STAC-A2TM Benchmark on POWER8

Bishop Brock, Frank Liu, Karthick Rajamani

IBM Research, 11501 Burnet Road, Austin, Texas 78758 {bcbrock,frankliu,karthick}@us..com

ABSTRACT and benchmark results have been published for traditional The STAC-A2TM benchmark is an emerging standard de- servers, GPU-based solutions and other heterogeneous com- signed to evaluate the speed, and quality of com- puting platforms [1, 2, 3]. We recently reported the first-ever benchmark results for putational platforms for performing financial risk analytics R R in the capital markets industry. The problem posed by the a STAC-A2 solution implemented on an IBM POWER8 benchmark is the computation of several types of Greeks S824 server [4], and our implementation holds several of the for an exotic option under an American exercise model. We public benchmark performance and scalability records across all platforms as of September, 2015. The 2-socket S824 sys- recently reported record-setting performance for a STAC- R R R R tem outperforms all 2-socket Intel Xeon configurations A2 benchmark solution developed for an IBM POWER8 TM S824 server. We explain the high performance of our solu- published to date, including servers with Intel Xeon Phi tion in terms of the architecture, scalability and high mem- acceleration [1, 2], and even maintains several records over ory bandwidth provided by POWER8 based systems. Devel- the latest 4-socket Intel Xeon system [5]. Furthermore, the performance of the S824 is within 10% of a recent R oping the benchmark application also led us to investigate R and perfect several techniques that are generally applica- Tesla K80 GPU result on a key benchmark [3], while the ble to the simulation of complex options and their sensitivi- memory space advantage of a traditional platform allows ties. We describe several of these techniques in detail, along the S824 to solve large problems much more efficiently than with the performance impacts we observed when compared current GPU-based solutions. with other approaches. We focus on two areas in particular, A review of the historical benchmark results [12] shows namely cache-efficient data management for Monte Carlo that performance has been advancing at a much greater pace simulation of American-exercise options, and a parallel im- than the raw capabilities of the underlying hardware, high- plementation of the Longstaff-Schwartz algorithm. lighting the importance of both high-performance hardware and efficient software to computational finance. The tech- nical contributions of this paper are the more general algo- Keywords rithms and system-level optimization we implemented in our POWER8, STAC-A2, Heston model, Longstaff-Schwartz, STAC-A2 solution. Matrix transpose, Parallel SVD, OpenPOWER We begin the paper by providing a brief overview of the benchmark problem. We then discuss the core, cache and 1. INTRODUCTION memory subsystems of the POWER8 architecture as they re- late to high-performance computing (HPC) workloads. We Pricing and sensitivity analysis of options are important problems in computational finance. The Securities Technol- also describe the software operating environment and struc- ogy Analysis Center (STAC R ), through the STAC Bench- ture of our parallel application. The benchmark option val- TM uation method requires all Monte Carlo path data to be mark Council , has chosen the computation of exotic op- stored for later analysis, effectively requiring the transpo- tion sensitivities as the basis of a benchmark to evaluate the sition of numerous large matrices. We discuss several solu- performance, scalability and quality of platforms for compu- TM tions including a non-obvious transposition algorithm, along tational finance. The STAC-A2 benchmark has rapidly become an industry standard in this area, with leading fi- with heuristics that improve the performance of the trans- nancial firms looking to these results as they evaluate new pose. Our benchmark application also includes a fast par- allel implementation of the Longstaff-Schwartz algorithm, hardware platforms. The STAC-A2 benchmark specifica- providing this inherently serial problem an almost “embar- tion does not restrict the technologies used in the solution, rassingly parallel” solution under certain conditions, even on Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not a traditional platform. We conclude the paper by relating made or distributed for profit or commercial advantage and that copies bear our performance results to the inherent scalability and high this notice and the full citation on the first page. Copyrights for components memory bandwidth of POWER8 based systems. of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request 2. BENCHMARK OVERVIEW permissions from [email protected]. The STAC-A2 benchmark, as well as other implementa- WHPCF2015, November 15–20, 2015, Austin, TX, USA c 2015 ACM. ISBN 978-1-4503-4015-1/15/11...$15.00 tions, have been introduced in many references [17, 20, 12, DOI: http://dx.doi.org/10.1145/2830556.2830557 11]. Therefore we only provide a brief operational overview Workload Type Assets Timesteps Paths Scenarios Arrays∗ Memory∗ Baseline Warm 5 252 25,000 93 52 2.5 GB Large Problem Warm 10 1260 100,000 308 152 143 GB Asset Scaling Cold 78∗ 252 25,000 15,642∗ 6,476 309 GB Path Scaling Cold 5 252 28,000,000∗ 93 52 2.8 TB

Table 1: Key STAC-A2 benchmark workloads. Items marked∗ are specific to our solution. focusing on the performance benchmarks. STAC-A2 bench- core has a 512KB private L2 cache and an 8MB L3 cache mark public reports (e.g., [4]) also include detailed descrip- which may be shared by other cores if needed. The 96MB tions of the benchmarks along with the results. total L3 cache per processor module is more than twice the The key performance benchmarks involve the computa- amount provided by competing Intel Xeon processors, e.g., tion of several Greeks for an American-exercise call option the Intel Xeon E7-8890 v3 and E5-2699 v3 only provide written against a basket of one or more correlated assets. 45MB. Later in the paper we point out places where the Option sensitivities are computed for changes in the risk- large L3 cache is important to the STAC-A2 workload. free rate (Rho), time to expiration (Theta), four model pa- Each POWER8 core provides four fully functional double- rameters (Model Vega), single and paired initial asset prices precision floating-point pipelines supporting atomic multiply- (Delta, Gamma, Cross Gamma) and paired correlations (Cor- add, square root and divide. Our STAC-A2 application relation Vega). The option is evaluated by a Monte Carlo makes extensive use of the Power ISATM Vector-Scalar Ex- simulation of Andersen’s QE formulation of the Heston stochas- tension (VSX) instructions which operate on 128-bit Vector- tic volatility model [13, 7, 16], followed by American-exercise Scalar Registers. VSX instructions include scalar functions, pricing using the method of Longstaff and Schwartz [18]. 2-way double-precision SIMD and 4-way single-precision SIMD The time to option expiration is discretized into a fixed num- operations. For 2-way SIMD, pairs of double-precision pipelines ber of timesteps, and varying numbers of Monte Carlo paths operate independently, allowing two hardware threads to ex- are simulated based on the particular benchmark. ecute SIMD operations simultaneously. As specified, the Greeks are computed using finite differ- The POWER8 core [23] supports up to 8-way simultane- ences of scenario values, where each scenario corresponds ous multithreading, or SMT. The core dynamically changes to either an unmodified option simulation, or a simulation SMT modes based on the number of non-idle threads, enter- where a model parameter, or one or more of the initial asset ing single-threaded (ST, or SMT1), SMT2, SMT4 or SMT8 values or correlations have been slightly modified. For ex- modes whenever 1, 2, 3–4 or 5–8 threads are active respec- ample, Theta is computed as (y∆t y)/∆t, where y∆t is the tively. Each SMT mode defines a different partitioning of the scenario value obtained by changing− the time to expiration core resources between the threads, e.g., dispatch logic, ex- by ∆t, and y is the base scenario value of the unmodified op- ecution pipelines and register files. The optimal SMT mode tion. Each scenario is conceptually a paths timesteps array is highly workload dependent and best determined by ex- of simulated asset-basket values. The option× being evalu- periment. Our Greeks application is most efficient in SMT4 ated is a lookback, best-of option where the basket value is mode, where the system-level performance is 2.1 times the defined as the maximum realized value of any asset in the single-threaded performance for the Baseline workload. basket from the initiation of the deal to the current time. It is important to distinguish between the number of hard- Each scenario array is eventually reduced to a single number, ware threads configured per core, which sets an upper bound the value of the option under that scenario. on SMT mode, and the number of hardware threads active The key benchmarks measure performance while comput- per core, which determines the actual SMT mode. In this ing the full set of Greeks using various numbers of correlated work we always configure eight hardware threads per core, assets, timesteps and paths, as shown in Table 1. The re- then bind application threads to unique hardware threads sults included are for runs that either do (cold) or do not using the pthread_attr_setaffinity_np() API. Through- (warm) include memory allocation overhead. The number out the paper, when we we describe a configuration as SMTn, of scenarios is fixed by the number of assets, while the num- we mean that we have bound n software threads to n of the ber of arrays and memory requirements are specific to our eight available hardware threads per core, and the cores ex- implementation. For the baseline and large problems the ecute in SMTn mode for the majority of the run. However goal is to compute the Greeks as quickly as possible using if the kernel activates an idle hardware thread with a pe- all available system resources. For the scaling workloads the riodic task or interrupt then the core may transition into goal is to maximize the number of assets or paths for which a higher SMT mode until the thread idles again. The n- the Greeks computation completes within 10 minutes. Al- of-eight thread binding we use has little or no performance though the S824 handles problem sizes perhaps larger than advantage or disadvantage over configuring only n hardware those normally found in practice, the scalability tests are threads per core, but does allow the application to select the very important for rounding out the coverage of the bench- optimum SMT mode for each individual benchmark without mark ( 6.2). reconfiguring the system. § The system includes 1 TB of 1600 MHz DDR3 memory arranged as 16 64MB Centaur DIMMs (CDIMM, [24]), 3. HARDWARE PLATFORM so named for the× Centaur (half buffer, half cache) mem- The results reported here were obtained using an IBM ory buffer ASIC present on each CDIMM. Each Centaur TM Power Systems S824 server with 2 POWER8 processor has 4 DDR ports for memory integrated on the CDIMM. modules, providing 24 total POWER8 cores, all running at The Centaur accelerates memory operations by aggregating the model-specific maximum frequency of 3.925 GHz. Each Figure 1: Option pricing application flow. industry-standard DRAM through a high-speed link to the processor. Each of the 16 Centaurs also includes 16 MB of level-4 cache, for a system-level total of 256MB of L4. Figure 2: Application threading model. POWER8 is specified to provide up to 192 GB/second of memory bandwidth per processor module, and others have reported achieving 85% of this figure with the STREAMS by combining arrays. Further details are beyond the scope triad benchmark [6]. of this paper, but are not required to understand the algo- The system is virtualized as a single dedicated partition R rithms and performance results presented. using the IBM PowerVM hypervisor. We run an unmodi- R R STAC-A2 performance benchmarks measure the latency fied version of Red Hat Enterprise 7. We disable of computing a large set of Greeks for a particular set of pa- the transparent hugepage support provided by recent Linux rameters, therefore parallel program structure and load bal- kernels in favor of large (64KB) pages. Although using huge ancing are key to competitive results. We implemented our (16MB) pages does improve warm run performance slightly, multithreaded C++ application using direct calls to POSIX allocating and coalescing huge pages inordinately penalizes threads APIs in order to maximize the potential for experi- the times for cold runs which are also reported as bench- mentation and analysis. A high-level view of the application marks. We use the IBM XL C/C++ compiler suite along parallel structure is presented in Figure 2. In (a), the master with the IBM MASS and ESSL numerical libraries, a config- thread T0 performs initialization and then spawns threads uration we find to provide significantly better performance T1, ..., T 1 which perform the Monte Carlo simulation in than current open-source alternatives. n− parallel (b). Each thread in our application simulates more- POWER8 based Linux systems offer a choice between or-less equal numbers of paths for every array. Simulation traditional big-endian operation, and little-endian environ- data structures are carefully aligned to avoid false sharing, ments providing compatibility and simplified porting for ap- and the granularity of work is typically a set of 16 paths, plications developed on Intel X86 systems. Although the where 16 is the number of double-precision values in the results reported here are for a big-endian environment, sev- POWER architecture 128- wide cache line. Simulation eral little-endian Linux distributions are also now available work is scheduled statically, but threads finishing early can for the S824 and other POWER8 based Linux servers. “steal” 16-path quanta of work from other threads. We do not begin American-exercise pricing until all threads 4. PROGRAM STRUCTURE have joined after Monte Carlo simulation. Once joined the STAC-A2 is a high-level specification expressed as an R- threads are typically more-or-less evenly grouped into co- language reference model. Each benchmark participant de- horts of one or more threads; Figure 2 (c) illustrates cohorts velops a complete, unique and proprietary solution to the of 2 threads. Each cohort is fully responsible for pricing one problem optimized for their particular hardware and soft- or more scenarios using the Longstaff-Schwartz algorithm as ware environment. Here we describe our approach to the discussed in 5.3. For example, the standalone Delta bench- problem. Because other solutions are largely undisclosed we mark for 5 assets§ requires pricing 10 scenarios. On our 24- can not provide detailed comparisons. core system in SMT4 mode, the 96 threads are grouped into Figure 1 illustrates the conceptual flow of our applica- 10 cohorts of 9 or 10 threads and all scenarios are priced tion. We use the AES-acceleration instructions introduced in parallel using a static schedule. Once the number of sce- in POWER8 to implement the high-quality ARS5 [22] cryp- narios exceeds the number of available threads we typically tographic random number generator (RNG). The RNG pro- treat each thread as a cohort of one. Load balancing is ac- cess includes converting uniform variates to unit-normal vari- complished by having cohorts choose scenarios to price from ates and correlation between assets. Monte Carlo simu- a global pool. Once pricing completes the master thread lation is implemented directly from the reference [7], and computes the final Greeks (Figure 2 (d)). the raw path data is post-processed to create O(assets 2) The scheme described above is modified when simulating paths timesteps arrays of asset-basket values under var- large numbers of paths, where the storage required to simu- ious conditions.× All arrays are generated simultaneously late and price all of the scenarios exceeds physical memory. from a single set of random numbers. The amount of data is In this case the workload is statically partitioned into mul- quadratic in the number of assets because Cross Gamma and tiple phases, where each phase simulates and prices a subset Correlation Vega measure sensitivities to changes in price or of the scenarios. We will also explain in 5.3 why it is ad- correlation respectively between all pairs of assets (combi- vantageous to price workloads with large§ numbers of paths N 2 nation 2 is O(N )). Table 1 details the number of arrays using a single cohort consisting of all threads. We refer to and the total data storage required for each workload. Many this as the all-threads-cohort (ATC) mode, distinguishing of the arrays directly correspond to scenarios, while in other it from the small-threads-cohort (STC) mode where smaller cases the scenarios are generated on-the-fly during pricing numbers of threads form the cohorts. transposition requires little or no extra storage, the most cache-efficient algorithm known requires each array element to be copied at least four times during the permutation [15]. We need to transpose the data with minimal extra storage and data movement, while taking advantage of the fact that the path data is “hot” in the caches. A straightforward copy of a path-pair P into a column- major paths timesteps array A is disastrous for perfor- mance (Alg.× 1). Since the data access stride is at least the number of paths, each column of A is in a unique virtual page (stressing translation mechanisms), and cache lines of A must be brought into the highest level caches multiple times before they are fully populated.

Alg. 1 Insert path-pair P into array A at path-index p. for t in 0,..., timesteps 1 do − Ap,t P2t; Ap+1,t P2t+1 end for← ←

One solution is to use a time-linear buffer to store path Figure 3: Illustration of the blocked matrix trans- data for a small set of paths, then copy the data into path- pose for a block-row of an m 12 array with 4 4 × × linear storage once the buffer is full. Recall from 4 that each blocks, for 10 underlying timesteps. thread is simultaneously generating paths for each§ row of ev- ery array to amortize the cost of random number generation. Since there are 16 double precision values in a POWER8 5. ALGORITHMS cache line, we can allocate an 8 (2 timesteps) buffer for We found the STAC-A2 benchmark to be a rich source of each thread for each array. Once× 8× path-pairs have been algorithm and performance optimization problems, and we inserted the copy is done such that each path-linear destina- describe a few of the problems and our solutions here. Please tion cache line is fully populated whenever touched (Alg. 2). note that unless otherwise indicated, the experimental re- In our system with 96 threads this requires storage for 1536 sults included here and in 6 do not define STAC bench- extra paths per array, which is a reasonable overhead. marks or report official STAC-A2§ benchmark results, and may use code configurations that have not been audited by Alg. 2 Insert path buffer B into array A at path-index p. STAC. Comparisons with competing systems are only made for t in 0,..., timesteps 1 do with reference to the audited benchmark report [4]. for j in 0,..., 7 do − 5.1 Path Generation Ap+2j,t Bj,2t; Ap+2j+1,t Bj,2t+1 end for ← ← The benchmark specifies the use of antithetic paths, a end for variance reduction technique used to speed convergence of Monte Carlo simulation [13]. For each set of random variates However in many cases a more efficient solution is to use Z0, ..., Zn that define a path, the antithetic path is also sim- a blocked, in-place matrix transposition. We pad the des- ulated with the set Z0, ..., Zn. We simulate the original and antithetic paths− in parallel− using 2-way double-precision tination array if necessary and treat it as a paths/16 timesteps/16 array of 16 16 blocks. Interleaved⌈ path-⌉ × SIMD operations to create a path-pair. A path-pair is the ⌈ ⌉ × element-wise interleaving of the original (P ) and antithetic pairs are deinterleaved and inserted in a way that fully pop- (N) paths ulates destination cache lines, but creates each block as the transpose of its final form (Alg. 3). Once all paths have P (0), N(0), ..., P (timesteps 1), N(timesteps 1) been inserted a straightforward square matrix transpose of − − each block completes the operation. Although each datum as a double-precision vector. Following post-processing we is still copied twice, the source and destination cache line deinterleave the paths during the matrix transposition dis- sets are identical here in the second copy, often making this cussed in 5.2. Note that in the following if path P has path approach more efficient than the buffered insertion. index p, the§ antithetic path N is always path p + 1. 5.2 Price Data Transposition Alg. 3 Blocked insertion of (padded) path-pair P into array A at path-index p, for block size . Monte Carlo simulation creates price data in time-order S along each path, while American-exercise pricing analyzes k p mod ; t 0 ← S timesteps← the prices for all paths at a given timestep. Straightforward, for b in 0,..., / 1 do for i in 0,...,⌈ 1 doS⌉− cache-efficient pricing therefore requires transposing simula- S − tion results from time-major into path-major storage. Our Ap+i,Sb+k P2t; Ap+i,Sb+k+1 P2t+1 t t + 1 ← ← application is too memory-constrained for large problems to ← implement a full out of place transpose, which would effec- end for tively double the storage requirements. Although in place end for WS, Technique ated by each of the 4 software threads. Workload MB a b c d Any technique other than the simple insertion reduces the ∗ Baseline 6.4 0.63 0.94 1 0.91 overhead of transposition such that the final differences be- ∗ Large Problem 93.5 0.46 1.00 1 1.04 tween the best performing methods are small, but generally ∗ Asset Scaling 797 0.85 0.95 1 1.06 repeatable. We prefer the blocked insertion due to its over- ∗ Path Scaling 6.4 0.63 1.01 1 0.93 all performance and simplicity. Although we have not done ∗ 5:126:25,000 3.2 0.75 0.97 1 0.88 exhaustive studies, our data suggests choosing whether or ∗ 5:1260:25,000 32.0 0.37 1.07 1 1.02 not to defer the final block transpose based on whether the per-core working set appears to fit in the 8MB L3 cache. Table 2: Relative performance of Monte Carlo simu- (Since we don’t control placement of the numerous arrays lation for different transpose methods, higher is bet- involved we can’t be sure the working set is actually fully ∗ ter, best marked : a) simple insertion; b) buffered resident). This heuristic is easy to compute from the work- insertion; c) blocked insertion, immediate transpose; load parameters. Although blocked insertion has no memory d) blocked insertion, deferred transpose. overhead and provides the best performance for many work- loads, there are also cases where the buffered copy performs best and we continue to investigate this area. Figure 3 illustrates the operation of Alg. 3 for hypothetical 4 4 blocks and 10 timesteps. In (A), data for interleaved × 5.3 Longstaff-Schwartz Pricing paths a and b, a0,b0, ..., a9,b9 is deinterleaved and inserted Longstaff and Schwartz published their least-squares Monte into a row of blocks. In (B), data for interleaved paths c and Carlo (LSMC) approach to valuing early-exercise options in d is inserted. Once all paths have been inserted, transposing 2001 [18]. The original reference is a well-written descrip- the 4 4 blocks brings the path data into the correct orien- tion of this popular numerical algorithm and its mathemat- tation×(C). Note that this algorithm requires the symmetry ics. Next we describe an implementation of the Longstaff- of a square block. A similar technique has been suggested in Schwartz algorithm that was motivated by the STAC-A2 the context of matrix transposition in GPU memories [21]. benchmark problem but can be generalized to other situa- The transposition of the square blocks can either be done tions. We speculate NVIDIA used a similar approach [11]. immediately after the last path of a set is inserted, or de- ferred until the arrays are fully populated. Immediately 5.3.1 Least Squares Fitting via SVD transposing a row of blocks suffers from large-stride issues, but could be advantageous if the data resides in the L3 cache. The challenge in valuing an early-exercise option is that Deferring transposition largely eliminates the problems of the holder must continuously decide whether to exercise the large strides in this phase because blocks can be transposed option immediately, or in the future. On the exercise date block-column by block-column, following the natural layout the option holder exercises the option if it is in the money. of column-major memory (Alg. 4). At a time prior to expiration the holder will exercise an in- the-money option if the future discounted cash flow from holding the option is expected to be less than the current Alg. 4 Deferred transposition of blocked array, for block value, and the value of the option is maximized if the exer- size . S cise happens as soon as this is true. LSMC estimates the for j in 0,..., timesteps/ 1 do future value by least squares regression using a cross sec- for i in 0,...,⌈ paths/ S⌉−1 do ⌈ S⌉− tion of simulated data, approximating this value as a linear Transpose 16 16 block Bi,j × combination of basis functions. end for We assume a timestep t (other than the final timestep). end for Let n be the number of basis functions, and A be the paths × n design matrix where each Aij is the application of basis The buffered and blocked transpose algorithms presented function j to the underlying value on path i at time t. Let here both benefit from manual prefetching, i.e., inserting b be a column vector where each bi is the future discounted instructions that provide hints to the hardware about mem- cash flow along path i. A and b are restricted to only con- ory access patterns [19]. Prefetch instructions (DCBT) are sider paths where the option is in-the-money at time t. The included by adding inline assembler calls to the otherwise regression problem is to find a set of coefficients x that min- straightforward C++ code. Manual prefetching improves imizes Ax b 2. Once the coefficients are known the esti- the performance of Monte Carlo simulation by approximately mated continuationk − k value is computed from the current path 5% for the blocked algorithm. Referring to Figure 3 for value and the basis functions. blocked insertion, prior to inserting data for a0,...,b3, we Compared to the pseudo-inverse method, the singular value issue prefetch instructions for the two cache lines that will decomposition (SVD) is preferred to implement the LSMC be populated by a4,...,b7, and so on until the final block regression due to its superior numerical stability [8]. Let the of a row. For buffered insertion and the block transpose we SVD of A be UΣV T , where Σ is diagonal and both U and V prefetch the 16 lines to be updated or transposed next. are unitary. The solution to the regression problem is then Table 2 compares the performance of Monte Carlo simu- x = V Σ−1U T b, where Σ−1 is the inverse of Σ. lation for the workloads defined in Table 1 and other param- Computing the coefficients x at time t requires the future eterizations denoted as assets:timesteps:paths (warm runs). value b computed at time t + 1 and so on. Thus LSMC is Buffered and blocked insertion performance includes opti- inherently a serial algorithm operating backwards in time. mized prefetch. The table also shows the SMT4 active work- One way of partially parallelizing the algorithm is to notice ing set (WS) of a POWER8 core during the simulation, i.e., that the SVD of A can be computed in advance at every the storage required to hold 16 paths for every array gener- timestep, since the SVD only depends on the path values which are all available at the end of Monte Carlo simulation Alg. 5 QR factorization of row Vandermonde matrix A [8]. Finishing the regression can be deferred until the future procedure VandermondeQR(p, n) ⊲ m = length(p), value is known. However there are issues with scaling this n = poly. order+1 technique. If a thread computes an SVD in advance it must σ1 m either wait for the future value to be available before contin- for←j in 2,..., 2n 1 do j−1 − uing, or store the SVD until that thread or another thread Bj,1 p 1/σ1 can complete the regression. Load balancing is also an issue end for← k k since any number of threads can be computing SVDs for a µ1 B21 ← number of timesteps in parallel, but the final regression is ν1 B31 ← 2 still ultimately serial in time. σ2 σ1(ν1 µ1) for←j in 3,...,− 2n 2 do 5.3.2 Parallel SVD of Vandermonde Matrices − Bj,2 (σ1/σ2)(Bj+1,1 µ1Bj,1) The computation of the SVD for“tall and skinny”matrices end for← − Am×n (m n) is usually by means of the QR decomposi- Q:,1 = 1/√σ1 ≫ tion, as in: Q:,2 =(p µ1)/√σ2 for k in 3,...,n− do Am×n = Qm×nRn×n µk−1 Bk,k−1 ← where Q is unitary and R is upper triangular. The QR νk−1 Bk+1,k−1 ← decomposition is usually handled by Householder reflection σk σk−1(νk−1 νk−2 + µk−1(µk−2 µk−1)) or Givens rotation[14]. Since R = UΣV T is the SVD of R for←j in k + 1,...,−2n k do − T − then A =(QU)ΣV is the SVD of A. Bj,k (σk−1/σk)(Bj+1,k−1 Bj,k−2+ ← − LSMC has been found to be robust in the choice of ba- +(µk−2 µk−1)Bj,k−1) − sis functions, and the STAC-A2 specification requires using end for polynomial basis functions. If p is the current price on path Q:,k ( σk−1/σk) ((p + µk−2 µk−1) Q:,k−1 i ← · − · − i, then each row of A has the form of: p σk−1/σk−2 Q:,k−2) − · 2 n−1 end for p 1 pi pi ... pi Σ diag(√σ1, √σ2,..., √σn) ← hence A is a row Vandermonde matrix. Although the QR B B1:n,: ← decomposition is parallelizable in general [10], for this type return B, Σ,Q of Vandermonde matrix there is a lesser-known QR factor- end procedure ization method which has excellent data locality and a sim- ple parallel implementation [9]. A derivative of the Lanczos method can be applied to perform the QR factorization, such second pass over the path data. Partitioning bT and W T that A = QΣB , where Qm×n is unitary, Σn×n is diagonal . and positive definite, and Bn×n is lower triangular (hence . W (1)1 W (1)2 . W (1)n BT is upper triangular) with values of one (1.0) at diagonal   .. entries. The technical details of the method can be found T T T T  W (2)1 W (2)2 . W (2)n  x = [b (1) b (2) ... b (N)]   in [9]. However we want to point out that the algorithm × ..   .  outlined in the original paper has a few typographic errors.  ··· ··· · · ·  The correct algorithm is outlined in Alg. 5.  ..  W (N)1 W (N)2 . W (N)  In practice we implement Alg. 5 in parallel in two phases.  n T The first phase of this QR decomposition method is the cal- shows that each of the coefficients xi can be computed as culation of the “moments” for the in-the-money paths T N T the simple sum of partial coefficients xi = j=1 b (j)W (j)i paths−1 where each thread j computes n independentP product terms. j−1 The result is the parallel Longstaff-Schwartz algorithm for Mj = p for j = 1,..., 2n 1 i − Xi=0 call options using simple polynomial basis functions of de- gree n 1 sketched as Alg. 6. Given the risk-free rate r, time used to generate B. Each thread first computes and shares to expiration− T and strike price K, a partition p of price data partial moments for a set of paths, and then computes the P at each timestep is used to compute a partition f of the final sums once all partial sums are available. For the bench- future value F . Once the algorithm terminates the scenario mark case BT is a small matrix whose SVD can then be value is simply the average cash flow at time 0, F 1/paths. computed in a few microseconds. Since each Qij only de- The amount of data exchanged between threadsk isk trivial for pends on pi and the moments, the parallel generation of Q reasonable values of n: n + 2 partial moments and n partial can be deferred until computing the regression coefficients. coefficients per thread. We actually compute ˆs by row, and Consider the case where the design matrix A is a paths n pˆ and ˆf are not computed as separate vectors. Instead we array and the paths are partitioned into sets 1,...,N for×N T T T use index vectors and other techniques to partition p and f threads. If B = UΣ2V is the SVD of B , it is easy to between in- and out-of-the-money paths in place, giving the show that the least squares solutions with given vector b can implementation the character of a sparse vector algorithm. be computed as: T T −1 −1 T T 5.3.3 Discussion x = b QUΣ Σ2 V = bpaths Wpaths×n We originally implemented Longstaff-Schwartz in parallel −1 −1 T Note that (UΣ Σ2 V )n×n is a small constant matrix that following [8], but found that performance for pricing a sce- is easily multiplied into Qpaths×n as Q is generated during a nario did not improve for cohorts of more than two threads! Alg. 6 Parallel Longstaff-Schwartz (call option) for poly- Cores SMT1 SMT2 SMT4 SMT8 nomial basis functions, executed by each thread. 1 1 1.6 2.0 2.0 12 11.1 18.7 22.7 21.9 d e−rT/timesteps ⊲ Per-timestep discount 24 20.9 34.6 44.3 41.4 for←t in timesteps 1,..., 0 do − p Pfirstpath:lastpath,t ⊲ Price vector at time t Table 3: STAC-A2 Baseline workload speedup. if ←t = timesteps 1 then f d max(−p K, 0) ⊲ Initial future value #$ #"""$ ""++$ ""+($ ← ∗ − ""+%$ else ""*+$ /'#-&** ""+$ #$ pˆ p : p > K ⊲ In-the-money paths ""*$$ ""*#$ ""'$ ˆ ← ""*$ f f : p > K ⊲ and future value "")$$ "")$$ ""(+$ ""$'$ ← j−1 "")$ ""()$ ""()$ Mj pˆ 1 for j = 1, ..., 2n 1 ⊲ Moments ← k k − Thread Join; Exchange partial moments ""($ ""')$ ""'$$ SVD and QR using M and pˆ producing W ""'$ ""&&$ ""&&$ T ˆ xi f Wi for i = 1,...,n ⊲ Partial coefficients ""&$ Thread← Join;× Exchange partial coefficients ""%$ ˆs [1 pˆ pˆ2 . . . pˆn−1] x ⊲ Least-squares fit ← × ""$$ pˆ K for pˆ K > ˆs f d − − ⊲ New fut. val. ""#$ ← ∗  f otherwise end if *# ('1$#*-+*,.**/'-&*%.((*)$)+,0*! *"$**+,) !$,%+,) "$ end for

In contrast, for the Baseline case (25K paths) Alg. 6 perfor- Figure 4: Memory bandwidth and performance. mance continues to improve up to 16 or 20 SMT4 threads per mode, which is not unusual for an HPC application [6]. cohort, at which point synchronization overheads become System-level scalability is an important factor in the end- excessive. Thus our approach is very effective for problems to-end performance of parallel applications like STAC-A2. with fewer scenarios to price than available threads. Table 3 also shows that this POWER8 based system achieves Once the number of paths exceeds about 500K very sig- (44.3/2.0)/24 = 92% of the theoretical scalability when scal- nificant performance advantages begin to accrue from pric- ing the workload from 1 to 24 SMT4 cores. The scalabil- ing all scenarios serially using a cohort of all threads (ATC ity of POWER8 stands in contrast to a recent STAC-A2 mode), primarily due to eliminating the imbalance of the result for a 2-socket Intel Xeon E5-2699 v3 based system variations in the times required to price individual scenarios. that showed only 73% of theoretical scalability for the same ATC mode naturally load balances scenario pricing since the R workload when scaling from 1 to 36 cores with Intel Hyper- threads move together in lock step through each timestep in Threading Technology enabled [1]. every scenario, and statistically do equal amounts of work if the paths are partitioned equally. ATC mode also benefits 6.2 Impact of Memory bandwidth from better cache utilization and ideal NUMA locality. The more threads per cohort, the smaller the per-thread parti- When the S824 benchmark audit set a new record for Path tions p and f, increasing the chances that data remains L3 Scaling by a wide margin, we were interested if we could cache-resident. In ATC mode a thread prices the same par- explain this result in terms of memory bandwidth. The tition of path data that it created during simulation. Thus POWER8 processor in the S824 system has eight memory channels per socket, each supporting a single CDIMM ( 3). the majority of data will be held in memory allocated “close” § to each core-pinned thread, in contrast to single-thread co- The number of populated memory channels has a linear im- horts where each thread processes a cross section of data pact on available memory bandwidth regardless of CDIMM created by every other thread. size. We created full (1.0, 1TB), half (0.5, 512GB) and ATC mode improves the end-to-end performance of the quarter (0.25, 256GB) bandwidth configurations by popu- Path Scaling workload by over 20% when simulating 28M lating 8, 4 and 2 slots per socket respectively with 64GB paths. As future work we would like to reduce the synchro- CDIMMs to test this hypothesis. (Note full bandwidth is nization overhead such that we could gain the benefits of also available starting at 256GB with smaller CDIMMs.) ATC mode for the Large Problem (100K paths), whose size Figure 4 lists the performance of different runs, each at is more representative of real-world problems. three different levels of available memory bandwidth nor- malized to the run with maximum memory bandwidth. The Baseline and Large Problem workloads are standard. For 6. PERFORMANCE ANALYSIS asset scaling we use 65 assets. For path scaling we look at 1M, 8M and 16M paths, for both the small-threads-cohort 6.1 Core- and System-Level Scalability (STC) and the all-threads-cohort (ATC) modes ( 4). Scal- The STAC-A2 benchmark includes a scaling experiment, ing beyond 8M paths or 65 assets requires more than§ 256GB and the audited results for the POWER8 S824 are sum- of memory and so does not provide comparison points for marized in Table 3 (data courtesy STAC). The table lists lower levels of memory capacity/bandwidth. The 16M paths relative speedups observed when scaling the Baseline prob- can be run with 512GB of memory but not with 256GB. lem from a single thread on a single core to all threads on The Asset Scaling problem is purely compute bound and 24 cores. SMT4 mode provides an approximate 2X speedup available memory bandwidth appears to have negligible im- over SMT1 at each scale. Performance rolls off in SMT8 pact on performance. This workload is completely domi- nated by the cache-contained O(assets 2) triangular matrix [6] A. V. Adinetz et al. Performance evaluation of scientific multiplication used to correlate random numbers, and the applications on POWER8. In High Performance O(assets 2) Monte Carlo simulation process. Computing Systems. Performance Modeling, As the number of paths increases from 25K (Baseline) to Benchmarking, and Simulation, volume 8966 of LNCS, pages 24–45. Springer, 2015. 100K (Large) to millions (Path Scaling), there is increasing [7] L. B. Andersen. Efficient simulation of the Heston impact on performance from reducing memory bandwidth. stochastic volatility model. http://ssrn.com/ Since the parallel Longstaff-Schwartz algorithm ( 5.3) re- abstract=946405, Jan 2007. § quires three passes over the price data for each timestep, [8] A. Choudhury et al. Optimizations in financial engineering: memory bandwidth is critical as cache capacities are ex- The least-squares Monte Carlo method of Longstaff and ceeded with larger number of paths. The ATC version sig- Schwartz. In Proc. 2008 IEEE Intl. Par. Dist. Proc. Symp. nificantly reduces the cache pressure by parallelizing each IPDPS, pages 1–11, April 2008. scenario across all the threads, lowering the cache require- [9] C. J. Demeure. Fast QR factorization of Vandermonde matrices. Linear Algebra and its Applications, ment per thread. Consequently, it is less memory bound and 122/123/124:165–194, 1989. sees smaller performance reduction for given bandwidth re- [10] J. Demmel et al. Communication-optimal parallel and duction compared to the STC version. Even so, memory sequential QR and LU factorizations. Technical Report bandwidth remains very important to top performance on UCB/EECS-2008-89, Aug 2008. http://www.eecs. the Path Scaling workload. berkeley.edu/Pubs/TechRpts/2008/EECS-2008-89.html. The POWER8 S824 currently still holds the record in the [11] J. Demouth. Monte-Carlo simulation of american options STAC-A2 Path Scaling benchmark, in part because of its with GPUs. http://on-demand.gputechconf.com/gtc/ 2014/presentations/ high bandwidth. The only comparable reported throughput S4784-monte-carlo-sim-american-options-gpus.pdf, Apr on this benchmark is for a 4-socket Intel Xeon E7-8890 v3 2014. based server which has system memory bandwidth roughly [12] E. Fiksman and S. Salahuddin. STAC-A2 on Intel equivalent to the S824 [5] - however, it requires 2X the num- architecture: From scalar code to heterogeneous ber of processors (3X the number of cores) as the POWER8 application. In Proc. the 7th Workshop High Perf. Comp. based system for the similar result. Finance (WHPCF), pages 53–60, Nov 2014. [13] P. Glasserman. Monte Carlo Methods in Financial Engineering. Applications of mathematics : stochastic 7. CONCLUSIONS modelling and applied probability. Springer, 2003. [14] G. H. Golub and C. F. Van Loan. Matrix Computations, STAC-A2 is a well-rounded HPC benchmark that stresses volume 3. JHU Press, 2012. a system at scale. It includes dense matrix and sparse vec- [15] F. Gustavson, L. Karlsson, and B. K˚agstr¨om. Parallel and tor algorithms, CPU-bound Monte Carlo simulation and cache-efficient in-place matrix storage format conversion. memory-intensive American-exercise pricing. In this paper, ACM Trans. Math. Softw., 38(3):17:1–17:32, Apr. 2012. we discuss different algorithmic optimization and demon- [16] S. Heston. A closed-form solution for options with strate their benefits on the IBM POWER8 S824 platform. stochastic volatility with applications to bond and currency We also discuss different system characteristics of the plat- options. Review of Financial Studies, 6(2):327–343, 1993. form and their importance to various aspects of the bench- [17] P. Lankford, L. Ericson, and A. Nikolaev. End-user driven technology benchmarks based on market-risk workloads. In mark. Our results demonstrate that POWER8 systems are High Perf. Computing, Networking, Storage and Analysis highly competitive platforms for computational finance. (SCC), 2012 SC Companion:, pages 1171–1175, Nov 2012. IBM, NVIDIA and other major technology companies re- [18] F. Longstaff and E. Schwartz. Valuing american options by TM cently announced the formation of the OpenPOWER foun- simulation: a simple least-squares approach. Review of dation, an organization dedicated to system designs centered Financial Studies, 14(1):113–147, 2001. on the POWER microprocessor. OpenPOWER systems in- [19] G. Mateescu, G. H. Bauer, and R. A. Fiedler. Optimizing clude unique support for heterogeneous computing including matrix transposes using a POWER7 cache model and the Coherent Attached Processor Interface (CAPI) and the explicit prefetching. SIGMETRICS Perform. Eval. Rev., 40(2):68–73, Oct. 2012. NVIDIA NVLINKTM interconnect. We expect even more [20] A. Nikolaev, I. Burylov, and S. Salahuddin. Intel R version interesting optimization opportunities for heterogeneous fi- of STAC-A2 benchmark: toward better performance with nancial workloads with future OpenPOWER systems. less effort. In Proc. the 6th Workshop High Perf. Comp. Finance (WHPCF), page 7. ACM, 2013. [21] G. Ruetsch and P. Micikevicius. Optimizing matrix 8. ACKNOWLEDGMENTS transpose in CUDA. http://docs.nvidia.com/cuda/ We would like to thank Kenneth Hill of the University of samples/6_Advanced/transpose/doc/MatrixTranspose. Florida and Julien Demouth of NVIDIA for their technical pdf. assistance and insights. [22] J. Salmon et al. Parallel random numbers: As easy as 1, 2, 3. In Proc. 2011 Intl. Conf. High Perf. Computing, Networking, Storage and Analysis (SC), pages 1–12, Nov 9. REFERENCES 2011. [1] STAC-A2 results, SUT ID: INTC140814, 2014. [23] B. Sinharoy et al. IBM POWER8 processor core http://www.stacresearch.com/INTC140814 . microarchitecture. IBM Journal of Research and [2] STAC-A2 results, SUT ID: INTC140815, 2014. Development, 59(1):2:1–2:21, Jan 2015. http://www.stacresearch.com/INTC140815. [24] W. Starke et al. The cache and memory subsystems of the [3] STAC-A2 results, SUT ID: NVDA141116, 2014. IBM POWER8 processor. IBM Journal of Research and http://www.stacresearch.com/NVDA141116. Development, 59(1):3:1–3:13, Jan 2015. [4] STAC-A2 results, SUT ID: IBM150305, 2015. http://www.stacresearch.com/IBM150305, 2015. All URLs listed above were valid as of October 15, 2015. [5] STAC-A2 results, SUT ID: INTC150811, 2015. http://www.stacresearch.com/INTC150811, 2015.