Using HPC Computational Physics Tools for Advanced Engineering Simulations and Production Deployment

David Hinz Sr. Director, IT Engineering Services / HPC

Byron Lengsfield III Technologist, Recording Subsystems March 26, 2018

©2017 Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 Disclaimer

• This presentation may contain product features that are currently under development. • This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. • Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. • Technical feasibility and market demand will affect final delivery. • Pricing and packaging for any new technologies or features discussed or presented have not been determined.

2 Agenda • Western Digital’s History with AWS and GPUs for HPC

• Current Uses of AWS GPUs

• Recommendations for Success

Confidential 3 Western Digital: A Heritage of Innovation 14,000+ active patents worldwide

1956 | IBM Invents First Hard Drive (RAMAC)

2017 | SanDisk 1976 | WD First 96-layer 3D NAND Technology First Sub-System Patent 2016 | SanDisk First 64-layer 3D NAND Technology 1988 | SunDisk Introduces Flash as a Storage Medium 2015 | SanDisk First 48-layer 3D NAND Technology

1991 | IBM 2013 | HGST Introduces MR Heads First Helium HDD (6TB)

1995| M-Systems 2013 | HGST First Flash Drive First 12Gb/s SAS SSD 2009 | SanDisk 2000| IBM & M-Systems First 64GB X4 First USB Flash Drive 2009 | Fusion-io 2000 | SanDisk First Enterprise PCIe Flash Accelerators First MLC NAND 2006| HGST 2008| G-Technology First 2.5-inch HDD First 1TB Portable PMR Technology External HDD

©2018 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 4 Portfolio of Leading

©2018 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 5 Leading Portfolio Breadth and Depth Solutions to Capture, Preserve, Access and Transform Data

Data Center Client Devices Client Solutions Devices & Solutions

Platforms/Systems/ Solutions

Devices

Media Device RISC-V Heads System Technologies Software Controllers Software

©2018 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 6 Fueling the Next Decade of Big Data New technology will enable 40TB by 2025 and even higher beyond

©2018 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 7 Western Digital Cloud HPC Architecture

• Dedicated AWS account for HPC – DirectConnect = Main – VPN = Backup • Clusters for the usage queue – By Organizations Cluster – By Applications SSH & VNC Master Computing Client data I/O Nodes – By Users Users attached • Autoscaling Cluster WDC – Autoscale based on the user queue NFS Shared – CycleCloud by CycleComputing Cycle Storage – Master node = Job Scheduler + NFS Cloud – Instance Store for NFS = i3 instance Oregon – Optimal Instance Type for the workload = c4, r4, p2

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 8 Shape Compute To Match Work To Be Done Molecular Dynamics 1.67x Initial Overall Throughput Gain Fluid Dynamics 1.4x Overall Throughput Gain

Molecular Dynamics MicroMagnetics

Throughput Parameter Sweeps Throughput Gain Simulation Type Increase Model 1 1.23x - 1.78x Head Drive Interface Vacuum 1.99x Gaps Model 2 1.01x – 1.67x Vacuum Gap "collection" 4.00x Model 3 1.23x – 1.69x Media Grains for HAMR (FePt/C) 2.03x Model 4 Up to 2.7x 4 Carbon Molecule Clusters 5.67x

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 9 Western Digital’s Innovations + the Cloud Advantage = Faster Business Value

HDD Arial Density Mechanical INNOVATIONS

6TB Number of Disks Cloud Air/Helium CLOUD Speed ADVANTAGE 4TB Lubes HPC Key to Scalability Enabling and Driving 85,000+ Cores Innovation 3+ Data Centers High Performance HPC CPU HPC GPU +6,000 Servers Computing (HPC) Cluster Cluster On-premise

HGST runs worlds largest F500 cloud HPC simulation run Reduces simulation time from 30 days to 8 hours!!!

Confidential 10 Western Digital : AWS and GPUs for HPC

• Western Digital has been a partner with AWS to understand workloads and use cases of HPC GPU computing for 3+ years

• Western Digital tested AWS GPU products and provided detailed performance benchmarks

• Western Digital directly engaged GPU suppliers directly to highlight use cases and requirements

• Actively using AWS GPUs for Machine Learning / Artificial Intelligence and High Performance Computing

Confidential 11 HDD Recording System Modeling: Outline

n Modeling Work Flow

n GPU/AWS Realization of the Work Flow

n Model Details

v Parallelization yields large gains

vLarger problems: 511 bit patterns, which allows to use a common set of signal processing tools in experiment and in the modeling

vOptimization in a large parameter space

vDescribe recording physics at a smaller length scale

n Conclusions

v Seamless migration to AWS

v Realization of the capabilities of a new generation of GPUs

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 12 (HDD) Magnetic Recording

… 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 …

1. Write step Write Head Read Head 2. Read step 3. Signal Processing Step Head Motion

Re Read Sensor 1 1 1 1 1 1 1 1 0 1 … 0 0 0 0 0 0 0 …

§ Input: 2-level write current § Output: Noisy readback waveform Tracks of data written § Goal: Reliably store data at highest on the disk possible areal density

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 13 Work Flow

Current waveforms input to head Write Current Waveforms model to produce write field in the media as a function of time (1500, 3D head fields)

Multilayer micromagnetic media models used to generate 400 repeats of Pseudo-Random Bit Pattern (PRBS) which is then Writer Model written by the write field, read and fed into a SWC for BER and other metrics Micromagnetic or FEM Writer Model for Head Fields Write Field

Micromagnetic Multi-layer Granular Magnetic Media Write Process:

Multi-Layer Media Model (GPUs)

- 3D Media Model - Many Coupled Layers - Thermal Field Reader Software Channel (SWC) Sensitivity Readback Function

Readback SIgnal Bit Error Rate (BER) and Analysis: Signal-to-Noise (SNR), Areal Density Capability (ADC), Distortions ©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 14 Fields generated from FEM or Micromagnetic Writer Model Head Motion Many critical dimensions and angles

Wrap-Around-Shield (WAS) Write Head-Under-Layer Pole Spacing (HUS)

trailing portion of WAS Gap between write pole and trailing portion of WAS Soft Under Layer (SUL) Write field magnitude and angle varies Competing magnetic flux paths, in all three dimensions flux can flow to the SUL or the shields Magnetic field produced in each media layer FEM = finite-element-method , magnetostatic model

©2017 Western Digital Corporation or itsFEM affiliates. All =rights finite reserved. Confidential.-element-method , magnetostatic model 3/26/18 15 Readback Process Reader Senstivity Function (RSF) 1 100 Cross Track rd.rwReadG22W26.6HUS36.6tSUL50x550dx2.5mu50.PRT 0.9 1.2 3D Reader Sensitivity Function (RSF): 0.8 1 50 0.7

Convolution of the RSF and magnetization 0.6 0.8 0 0.5

0.6 0.4

0.4 -50 0.3 1. Analytic or micromagnetic program to generate RSF 0.2 0.2 0.1 2. Readback is very fast -100 0 Ø All crosstrack positions are obtained at one time 0 MRW= 26.9923 -50 0 50 100 -0.2 using 2D FFT -300 -200 -100 0 100 200 300 Ø Post-processing step, all magnetization data is on a separate server. Ø Convolve 3D RSF with the magnetization.

3. Full micromagnetic readback is available but more compute intensive

Different RSF for each media layer

2D Cut of the RSF

Signal RSF Magnetization

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 16 Multi-layer Media Model: Exchange between layers and grains Top surface of the media

Magnetization aligned perpendicular to the surface of the media along anisotropy axis |k> .

2 Ea = K

Macro-spin Model ( each layer is a macro-spin ) Ø Long range magnetic field computed using FFT CUDA library Ø Exchange interaction is parallelized over 3D regions

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 17 Batch System Job Flow

Linux Scripts Generate Input from Templates which are Submitted to the Batch System • The input template contains hardwired input values and variables which are specified in the Submit Script. GPUs: Coarse Grain Parallel CPU GPU Master Node Master Node Post-Processing Node

Submit scripts Processing step generate an The batch system distributes (readback) to ensemble of inputs Output is moved to the command files on the generate BER, SNR, and on the master compute nodes.. the master node ADC,… node

AWS

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 18 GPU code implementation • Example: Anisotropy field ( one body: GPU|K> field code term )implementation üGrain-wise anisotropy fields computation are assigned to different cores on a GPU üEach layer in a grain runs independently

• CUDA Random Numbers for Thermal Field • CUDA FFT for Magnetic Field • Parallel 3D Group Exchange

GPU

Media

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 19 Analysis of Small Bit Pattern ~ 620 Runs Needed 1 1 1 1 3 1 1 4 2 1 1 1 Avg. Readback Center: Avg. Readback Avg. Magnetization via a Linear Model Noise Terms

1) Jitter from STD of the transition position 2) DCsnr from STD of signal at bit center 3) Head noise: HEsnr from amplitude and head noise Size of smallest signal relative to head noise 4) T50 (signal width) from the readback

Noise Autocorrelation • Jitter noise power is dominant at LF Transition • Non-transition noise is pattern dependent Run Time ~ 2-3 hours per CPU core • increases as length of the bit increases 1-3 minutes on a GPU • noise can be seen in average magnetization (K80, Single Precision) and smearing in autocorrelation matrix

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 20 Windowing Algorithm for 511 bit PRBS

• Cuda FFT is used in each Window

• Head moves in the Window

• Then Window moves Ø Head is always near the center ØProblem re-started in each Window

N+1

N

Motion of the Head N-1 Local iteration Triple track 511 bit PRBS now practical with GPUs

Motion of the Window Global iteration

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 21 Discretized Media Model

GPUs give us the ability to simulate recording physics at a smaller length scale

Many variables: Jsii is the surface exchange between cells in the internal regions Jsis/Jssi is the surface exchange between cells at the boundary of the internal and shell regions Jss is the exchange between boundary cells Jsb is the exchange between the shell and the magnetic boundary ( lateral exchange ) Grain properties are partition into regions ‘ Shell Region

Cap ECL ‘7-Layer’ Model Capped-Soft-Hard Media Meida Boundary ECL ECL (exchange coupling layer) Region Media thickness and vertical exchange are varied ECL Cubic Grid The limit of “metallic” vertical exchange ECL between the layers can be studied Media (thin ECL and ~ “metallic” vertical exchange) Internal Region ECL Grain Media Thus this Model can represent a range of physical (sputtered) layers ECL

Media Magnetization in each cell is independently propagated in the LLG+T simulation

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 22 Macro-spin model vs. Discretized models

• Closure domain formation effects resolution performance – Independent spins in the discretized model can describe the gradual change of magnetization in the cap (top) layer – Closure domains can form in the discretized model – Closure domains lead to poorer resolution than found in macro-spin model in agreement with experiment

In plane magnetization shows the presence or absence of closure domains

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 23 Improved Domain Wall (DW) Media Performance Under Squeeze:

PMR Squeezed Old DW Squeezed Soft-hard Structure GPUs make multi-dimensional optimization practical in these systems Jitter ~ 2.03 nm Jitter ~ 2.30 nm DCsnr ~ 19.1 dB DCsnr ~ 13.4 dB

• Optimization algorithms are implemented at the script Improved DW Squeezed level (outer loop)

• DW ‘optimization’ yielded much improved squeeze Jitter ~ 1.93 nm performance DCsnr ~ 16.1 dB

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 24 Technical Conclusions nLarge throughput gains seen and new capabilities realized by moving models to GPUs

v Enhanced performance

Ø Rapid turn-around

Ø Optimization in a larger parameter space

Ø Minimize CPU-GPU communications

vAbility to handle larger problems and use the same tools employed in experiments

vAbility to treat recording physics on a smaller length scale nAWS Realization

v Seamless migration from existing GPU clusters

v Realization of the capabilities of a new generation of GPUs: we saw a large improvement in performance in going from K80s to P3s

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 25 Recommendations For Success

• Workload Expertise is Imperative – All workloads are not the same relative to GPU vs. CPU dependence and parallelization

• Complete Proof of Concepts / Trials Before Committing Large Deployments – 30 day PoCs allow different configurations and workflows to be evaluated

• Work with AWS and NVidia if the results are not meeting expectations – Remove blocking problems through strong partnerships that are developed in meetings not in email exchanges

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 26 ©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 Realistic Media Distributions

Exchange distribution Bin 2D distribution: (sK, sV) matrix

Mean Field Exchange Strength Anisotropy Volume and Volume Dependence Distribution Distribution • Large number of parameters require many simulations to converge Signal-to-Noise (SNR) and other performance metrics.

• Typically 600 runs to converge transition position jitter and estimate Areal Density Capability (ADC).

• Ensemble members are run in parallel.

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 28 Noise Sources

2

1 3

Noise in ensemble members Average Readback Signal Good Resolution

1. Transition position jitter 2. DCsnr: noise at the center of the bit ü poor writing ü erasure due to writing an adjacent track Poor Resolution 3. Electronic noise ü Poor resolution ü Head and Electronics noise (HEsnr)

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 29 Media Model: Discretization

Media Model Resolves Grain Boundaries ~ 1 nm

Macro-spin model: Each layer is a spin Many opportunities for parallelization

1. One-body terms ( Anisotropy ) 2. Two-body terms ( Exchange in 3D ) 3. Long Range magnetic interactions ( CUDA FFT) 4. Stochastic (Thermal) Field ( CUDA Random numbers)

Magnetization propagates via LLG + Thermal Field equation dM/dt = -(g/(1+a2)) M x Heff + = -(g/((1+a2)Ms)) M x ( M x Heff ) a = damping constant g = gyromagnetic ratio

©2017 Western Digital Corporation or its affiliates. All rights reserved. Confidential. 3/26/18 30