What Does It Take to Accelerate SPICE on the GPU?

M. Naumov, F. Lannutti, S. Chetlur, L.S. Chien and P. Vandermersch What is SPICE?

. Simulation Program with Integrated Circuit Emphasis — First version was developed by Laurence Nagel in 1973 — http://en.wikipedia.org/wiki/SPICE

. There exist many variations (not limited to) — Academic: ngspice, spice3 (UC – Berkeley), XSPICE (GeorgiaTech) — Industrial: HSPICE (Synopsys), Pspice (Cadence), Eldo (Mentor), EEsof (Agilent) What does SPICE do?

Circuit (diagram):

R R 1 1 2 3 3

R2 R4 Ixs Vs What does SPICE do?

Circuit (diagram): Netlist (text file): nodes i j R R 1 1 2 3 3 R1 1 2 1k R2 2 0 1k R2 R4 Ixs Vs R3 2 3 0.4k R4 3 0 0.1k V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5) What does SPICE do?

Circuit (diagram): Netlist (text file): nodes i j R R 1 1 2 3 3 R1 1 2 1k R2 2 0 1k R2 R4 Ixs Vs R3 2 3 0.4k R4 3 0 0.1k V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)

Physics (Kirchhoff + Ohms + ...): Stamp Voltage Source Stamp Sparse Matrix RHS Sparse Matrix RHS col col V V I V V I row i j xs row i j xs i 1/R -1/R i 1 j -1/R 1/R j -1

xs xs 1 -1 Vs What does SPICE do?

Circuit (diagram): Netlist (text file): nodes i j R R 1 1 2 3 3 R1 1 2 1k R2 2 0 1k R2 R4 Ixs Vs R3 2 3 0.4k R4 3 0 0.1k V1 1 0 PWL (0 0 1n 0 1.1n 5 2n 5)

Linear system (sparse): Physics (Kirchhoff + Ohms + ...): Resistor Stamp Voltage Source Stamp node 1 -1/R -1 Sparse Matrix RHS Sparse Matrix RHS 1/R1 1 V1 col col V V I node 2 (1/R +1/R +1/R ) Vi Vj Ixs i j xs -1/R1 1 2 3 -1/R3 V2 row row node 3 i 1/R -1/R i 1 -1/R3 (1/R3+1/R4) V3 source j -1/R 1/R j -1 1 Ixs Vs xs xs 1 -1 Vs SPICE Details

. Input — Parse netlist and setup internal data structures

. DC Analysis — Device model evaluation Newton-Raphson — Linear system solution

. Transient Analysis

— Device model evaluation For each time step: — Linear system solution - Newton-Raphson — Truncation error + Time step correction SPICE Details

. Device Model Evaluation — Takes between 30%-60% of the simulation

. DC Analysis — Device model evaluation Newton-Raphson — Linear system solution

. Transient Analysis

— Device model evaluation For each time step: — Linear system solution - Newton-Raphson — Truncation error + Time step correction SPICE Details

. Linear System Solution — Takes between 30%-60% of the simulation

. DC Analysis — Device model evaluation Newton-Raphson — Linear system solution

. Transient Analysis

— Device model evaluation For each time step: — Linear system solution - Newton-Raphson — Truncation error + Time step correction Device Model Evaluation

. Basic models — Resistor, , , Voltage and Current Source

. Transistor models — MOSFET transistor (BSIM4v7, PSP, etc.) — Bipolar transistor (Ebers–Moll, Gummel-Poon, etc.)

. Other models — , etc. Device Model Evaluation

. Basic models — Resistor, Capacitor, Inductor, Voltage and Current Source

. Transistor models focus of this presentation — MOSFET transistor (BSIM4v7, PSP, etc.) — Bipolar transistor (Ebers–Moll, Gummel-Poon, etc.)

. Other models — Diodes, etc. Device Model Evaluation

. Key Idea (Transistor - BSIM4v7) if() if() — Many branches are related to fixed parameters if() ... . Temperature else ... . Operation Regime else ... — Reorganize the code (slightly) . Minimize thread divergence . Maximize memory coalescing

T1 T2 T3 T4 ... T10K ... T100K ... Tn

BSIM4v7 ... Instances Basic Device Model Evaluation

40 35 Resistor Netlist Capacitor Netlist 30 Inductor Netlist 25 20

Speedup 15 10 5 0 8192 16384 32768 65536 131072 1) Resistor Netlist: all ; number of instances of models* 2) Capacitor Netlist: half and half resistors; 3) Inductor Netlist: half resistors, quarter capacitors and quarter ; *NGSPICE Performance may vary based on *NVIDIA C2070, ECC on OS version and motherboard configuration *Intel X5690 (Nehalem, 6 CoreTM) @ 3.47GHz Transistor (BSIM4v7) Device Model Evaluation

60 CPU (1 core) 6.67x 50 GPU 40 30

Time (ms) Time 20 10 0

ISCAS85 Benchmark Suite

*NGSPICE Performance may vary based on *NVIDIA C2070, ECC on OS version and motherboard configuration *Intel X5690 (Nehalem, 6 CoreTM) @ 3.47GHz Solution of Linear Systems

. Solve a set of (sparse) linear systems

Ai xi = fi for i=1,...,k where the coefficient matrices Ai have the same sparsity pattern

. Matrix properties — Nonsymmetric — Ill-conditioned

. Different methods — Direct Methods (LU factorization + triangular solve) — Iterative Methods (GMRES, BiCGStab, etc.)

Solution of Linear Systems

. Solve a set of (sparse) linear systems

Ai xi = fi for i=1,...,k where the coefficient matrices Ai have the same sparsity pattern

. Matrix properties — Nonsymmetric — Ill-conditioned

focus of this presentation . Different methods — Direct Methods (LU factorization + triangular solve) — Iterative Methods (GMRES, BiCGStab, etc.)

Sparse Direct Methods

. Original linear system A x = f . Reordering (to minimize fill-in) (A Q) (QT x) = f where QTQ=QQT=I . Pivoting (PT A Q) (QT x) = PT f where PTP=PPT=I . LU factorization (PT A Q) = L U . Forward and backward (triangular) solve L (U y) = b where y = QT x and b = PT x

Sparse Direct Methods

. Recall — Solving a set of linear systems — Coefficient matrices have the same sparsity pattern . Assume — reordering (to minimize fill-in) is the same — pivoting is also constant . LU factorization (i=1) focus of this presentation (PT A Q) = L U . LU re-factorization (i=2,...,k) — Sparsity (the required memory) of L and U is known ahead of time

GLU: LU re-factorization on the GPU

. Key Idea — LU-factorization

A = LU — Incomplete-LU factorization (zeroed) (zeroed) M = L +U +A

equivalent

GLU: LU re-factorization on the GPU

. Key Idea Solving a set of systems Ai xi = fi (i=1,...,k) — LU-factorization A = L U (i=1) A = LU 1 1 1 — Incomplete-LU factorization (zeroed) (zeroed) M = L (zeroed)+U (zeroed)+A M = L +U +A i 1 1 i (i=2,...,k)

equivalent

GLU: LU re-factorization on the GPU

. Key Idea Solving a set of systems Ai xi = fi (i=1,...,k) — LU-factorization A = L U (i=1) A = LU 1 1 1 — Incomplete-LU factorization (zeroed) (zeroed) M = L (zeroed)+U (zeroed)+A M = L +U +A i 1 1 i (i=2,...,k)

equivalent

— Many parallel techniques are applicable GLU: LU re-factorization on the GPU

. GLU — Developed in CUDA programming language for GPUs — Sparsity pattern of L and U known ahead of time — Memory requirements known ahead of time . vs. KLU, which is — Designed specifically for circuit simulation — Gilbert-Peierls (single threaded) . vs. PARDISO, which is — Supernodal method (multi-threaded)

Test matrices can be found at Review of sparse direct solvers can be found at http://www.cise.ufl.edu/research/sparse/matrices/ http://www.cise.ufl.edu/research/sparse/codes/

GLU Speedup (C2070)

14.3 7.5 7.0|25.2 4 GLU vs. KLU (1t) GLU vs. PARDISO (6t)

3

2 Speedup

1

0 rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M

Performance may vary based on *NVIDIA C2070, ECC on OS version and motherboard configuration *Intel X5680 (Nehalem, 6 CoreTM) @ 3.33GHz, MKL 10.3.6 GLU Speedup (K20x)

16.1 8.6 7.0|5.4 4 GLU vs. KLU (1t)

GLU vs. PARDISO (8t)

3

2 Speedup

1

0 rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M

Performance may vary based on * NVIDIA K20, ECC on OS version and motherboard configuration * Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6 GLU Speedup (K20x)

16.1 8.6 7.0|5.4 4 GLU vs. KLU (1t)

GLU vs. PARDISO (8t)

3

2 Speedup

1

0 rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M

Average Speedup vs. KLU: 2x

Performance may vary based on * NVIDIA K20, ECC on OS version and motherboard configuration * Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6 GLU Speedup (K20x)

16.1 8.6 7.0|5.4 4 GLU vs. KLU (1t)

GLU vs. PARDISO (8t)

3

2 Speedup

1

0 rajat17 rajat23 trans4 G2_circuit transient ASIC_680ks ASIC_680k G3_circuit Freescale1 circuit5M

Average Speedup vs. PARDISO: 2.5x

Performance may vary based on * NVIDIA K20, ECC on OS version and motherboard configuration * Intel E5-2687w (Sandy Bridge, 8 CoreTM) @ 3.1GHz, MKL 10.3.6 Conclusion . SPICE simulation two most time consuming parts — Device model evaluation — Solution of linear systems . Device model evaluation — Speedup* of up to 6x . Solution of linear systems — Speedup* (average) of 2x . GPU (overall) acceleration — SPICE (overall expected) speedup of 2-3x — No slowdown: easy to test an iteration (and revert back if needed)

*: speedup is dependent on input parameters

Thank you

Questions?