Trace-Penalty Minimization for Large-scale Eigenspace Computation
Yin Zhang
Department of Computational and Applied Mathematics Rice University, Houston, Texas, USA
(Co-authors: Zaiwen Wen, Chao Yang and Xin Liu) 1
CAAM VIGRE Seminar, January 31, 2013
1SJTU (Shanghai), LBL (Berkeley) and CAS (Beijing) Yin Zhang (RICE) EIGPEN February, 2013 1 / 39 Outline
1 Introduction Problem Description Existing Methods
2 Motivation Large Scale Redefined Avoid the Bottleneck
3 Trace-Penalty Minimization Basic Idea Model Analysis Algorithm Framework
4 Numerical Results and Conclusion Numerical Experiments and Results Concluding Remarks
Yin Zhang (RICE) EIGPEN February, 2013 2 / 39 Section I. Eigenvalue/vector Computation:
Fundamental, yet still Challenging
Yin Zhang (RICE) EIGPEN February, 2013 3 / 39 Problem Description and Applications
Given a symmetric n × n real matrix A Eigenvalue Decomposition
AQ = QΛ (1.1)
Q ∈ Rn×n is orthogonal.
n×n Λ ∈ R is diagonal (with λ1 ≤ λ2 ≤ ... ≤ λn on diagonal). k−truncated decomposition (k largest/smallest eigenvalues)
AQk = Qk Λk (1.2)
n×k Qk ∈ R with orthonormal columns; k n. k×k Λk ∈ R is diagonal with largest/smallest k eigenvalues.
Yin Zhang (RICE) EIGPEN February, 2013 4 / 39 Applications
Basic Problem in Numerical Linear Algebra Various scientific and engineering applications Lowest-energy states (Materials, Physics, Chemistry) Density functional theory for Electron Structures Nonlinear eigenvalue problems Singular Value Decomposition Data analysis, e.g., PCA Ill-posed problems Matrix rank minimization F Increasingly large-scale sparse matrices F Increasingly large portions of the spectrum
Yin Zhang (RICE) EIGPEN February, 2013 5 / 39 Some Existing Methods Books and Surveys Saad, 1992, “Numerical Methods for Large Eigenvalue Problems” Sorensen, 2002, “Numerical Methods for Large Eigenvalue Problems” Hernandez´ et al, 2009, “A Survey of Software for Sparse Eigenvalue Problems” Krylov Subspace Techniques
Arnodi methods, Lanczos Methods – ARPACK (eigs in Matlab) Sorensen, 1996, “Implicitly Restarted Arnoldi/Lanczos Methods for ...... ” Krylov-Schur, ...... Optimization based, e.g., LOBPCG Subspace Iteration, Jacobi-Davidson, polynomial filtering, ...
Keep orthogonality: X T X = I at each iteration Rayleigh-Ritz (RR): [V, D] = eig(X T AX); X = X ∗ V;
Yin Zhang (RICE) EIGPEN February, 2013 6 / 39 Section II. Motivation:
A Method for Larger Eigenspaces with Richer Parallelism
Yin Zhang (RICE) EIGPEN February, 2013 7 / 39 What is Large Scale?
Ordinarily Large Scale: A large and sparse matrix, say n = 1M A small number of eigen-pairs, say k = 100
Doubly Large Scale: A large and sparse matrix, say n = 1M A large number of eigen-pairs, say k = 1% ∗ n A sequence of doubly large scale problems
Change of characters as k jumps: X ∈ Rn×k Cost ofRR/ orth(X) AX Parallelism becomes a critical factor
Low parallelism in RR/Orth =⇒ Opportunity for new methods?
Yin Zhang (RICE) EIGPEN February, 2013 8 / 39 Example: DFT, Materials Science
Kohn-Sham Total Energy Minimization
T min Etotal(X) s.t. X X = I, (2.1)
where, for ρ(X) := diag(XX T ), ! 1 1 E (X) := tr X T ( L + V )X + ρTL †ρ + ρT (ρ) + E . total 2 ion 2 xc rep
Nonlinear eigenvalue problem: up to 10% smallest eigen-pairs. A Main Approach: SCF — a sequence of linear eigenvalue problems
Yin Zhang (RICE) EIGPEN February, 2013 9 / 39 Avoid the Bottleneck
Two Types of Computation: AX and RR/orth As k becomes large, AX is dominated by RR/orth — bottleneck
Parallelism
AX −→ Ax1 ∪ Ax2 ∪ ... ∪ Axk . Higher. RR/orth contains sequentiality. Lower.
Avoid bottleneck? Do fewer RR/orth No free lunch? Do more BLAS3 (higher parallelism than AX)
Yin Zhang (RICE) EIGPEN February, 2013 10 / 39 Section III. Trace-Penalty Minimization:
Free of Orthogonalization BLAS3-Dominated Computation
Yin Zhang (RICE) EIGPEN February, 2013 11 / 39 Basic Idea
Trace Minimization
min {tr(X TAX): X TX = I} (3.1) X∈Rn×k
Trace-penalty Minimization
1 T µ T 2 min f(X) := tr(X AX) + kX X − IkF. (3.2) X∈Rm×k 2 4
It is well known that µ → ∞, (3.2) =⇒ (3.1)
Quadratic Penalty Function (Courant 1940’s) This idea appears old and unsophisticated. However, ......
Yin Zhang (RICE) EIGPEN February, 2013 12 / 39 “Exact” Penalty
However, µ → ∞ is unnecessary. Theorem (Equivalence in Eigenspace) Problem (3.2) is equivalent to (3.1) if and only if
µ > λk . (3.3)
Under (3.3), all minimizers of (3.2) have the SVD form:
1/2 T Xˆ = Qk (I − Λk /µ) V , (3.4)
where Qk consist of k eigenvectors associated with a set of k smallest k×k eigenvalues that form the diagonal matrix Λk , and V ∈ R is any orthogonal matrix.
Yin Zhang (RICE) EIGPEN February, 2013 13 / 39 Fewer Saddle Points
Original Model: min{tr(X TAX): X TX = I, X ∈ Rn×k } One minimum/maximum subspace (discounting multiplicity). All k-dimensional eigenspaces are saddle points.
However, for the penalty model: Theorem Let f(X) be the penalty function associated with parameter µ > 0.
1 For µ ∈ (λk , λn), f(X) has a unique minimum, no maximum.
2 For µ ∈ (λk , λk+p) where λk+p is the smallest eigenvalue > λk , a rank-k stationary point must be a minimizers, as defined in (3.4).
In a sense, the penalty model is much stronger.
Yin Zhang (RICE) EIGPEN February, 2013 14 / 39 Error Bounds between Optimality Conditions
First order condition Our penalty model: 0 = ∇f(X) , AX + µX(X TX − I); Original model: 0 = R(X) , AY(X) − Y(X)(Y(X)TAY(X)), where Y(X) is an orthonormal basis of span{X}.
Lemma
Let ∇f(X) (with µ > λk ) and R(X) be defined as above, then
−1 kR(X)kF ≤ σmin(X)k∇f(X)kF , (3.5)
where σmin(X) is the smallest singular value of X. Moreover, for any global minimizer Xˆ and any > 0, there exists δ > 0 such that whenever kX − XˆkF ≤ δ, 1 + kR(X)kF ≤ p k∇f(X)kF . (3.6) 1 − λk /µ
Yin Zhang (RICE) EIGPEN February, 2013 15 / 39 Condition Number
Condition Number of the Hessian at Solution
2 2 2 κ(∇ f(Xˆ)) , λmax(∇ f(Xˆ))/λmin(∇ f(Xˆ))
Determining factor for asymptotic convergence rate of gradient methods Lemma
Let Xˆ be a global minimizer of (3.2) with µ > λk . The condition number of the Hessian at Xˆ satisfies
max (2(µ − λ1), (λn − λ1)) κ ∇2f(Xˆ) ≥ . (3.7) min (2(µ − λk ), (λk+1 − λk ))
In particular, the above holds as an equality for k = 1.
Gradient methods may encounter slow convergence at the end.
Yin Zhang (RICE) EIGPEN February, 2013 16 / 39 Generalizations
Generalized eigenvalue problems: X TX = I → X TBX = I Keep out undesired subspace: UTX = 0 (UT U = I)
Trace Minimization with Subspace Constraint
min {tr(X TAX): X TBX = I, UTX = 0} X∈Rn×k
Trace-Penalty Formulation
1 T T µ T T 2 min tr(X Q AQX) + kX Q BQX − IkF X 2 4
where Q = I − UUT (QX = X − U(UT X)). With changes of variables, all results still hold.
Yin Zhang (RICE) EIGPEN February, 2013 17 / 39 Algorithms for Trace-Penalty Minimization
Gradient Methods:
X ← X − α∇f(X). ∇f(X) = AX + µX(X TX − I)
First Order Condition:
∇f(X) = 0 ⇔ AX = X(I − X TX)µ
2 Types of Computations for ∇f(X): 1 AX: O(k nnz(A)) 2 X(X T X): O(k 2n) — BLAS3 (2) dominates (1) whenever k nnz(A)/n Gradient methods requires NO RR/Orth
Yin Zhang (RICE) EIGPEN February, 2013 18 / 39 Gradient Method
Preserve Full Rank Lemma Let X j+1 be generated by
X j+1 = X j − αj∇f(X j)
from a full rank X j. Then X j+1 is rank deficient only if 1/αj is one of the k generalized eigenvalues of the problem:
[(X j)T ∇f(X j)]u = λ[(X j)T (X j)]u.
j j j j+1 On the other hand, if α < σmin(X )/||∇f(X )||2,X is full rank.
Combined with previous results, there is a high probability of getting a global minimizer by using gradient type methods.
Yin Zhang (RICE) EIGPEN February, 2013 19 / 39 Gradient Methods (Cont’d)
X j+1 = X j − αj∇f(X j)
Step Size α Non-monotone line search (Grippo 1986, Zhang-Hager 2004) Initial BB step:
tr((Sj)TY j) αj = arg min ||Sj − αY j||2 = F || j||2 α Y F
where Sj = X j − X j−1, Y j = ∇f(X j) − ∇f(X j−1). Many other choices
Yin Zhang (RICE) EIGPEN February, 2013 20 / 39 Current Algorithm
Framework: 1 Pre-process — scaling, shifting, preconditioning 2 Penalty parameter µ — dynamically adjusted 3 Gradient iterations — main operations: X(X T X) and AX 4 RR Restart — computing Ritz-pairs and restarting
(Further steps possible, but NOT used in comparison) 5 Deflation — working on desired subspaces only 6 Chebychev Filter — improving accuracy
Yin Zhang (RICE) EIGPEN February, 2013 21 / 39 Enhancement: RR Restarting
RR Steps return Ritz-pairs for given subspaces
1 Orthogonalization: Q ∈ orth(X) 2 Eigenvalue decomposition: QTAQ = V TΣV 3 Ritz-paires: QV and diag(Σ)
RR Steps ensure accurate terminations
RR Steps can accelerate convergence
Very few RR Steps are used
Yin Zhang (RICE) EIGPEN February, 2013 22 / 39 Section IV. Numerical Results and Conclusion
Yin Zhang (RICE) EIGPEN February, 2013 23 / 39 Pilot Tests in Matlab
Matrix: delsq(numgrid(’S’,102)); size: n = 10000; tol = 1e-3 CPU Time in Seconds
120 300 eigs eigs lobpcg lobpcg eigpen 100 eigpen 250
80 200
150 60 CPU Second CPU Second
100 40
50 20
50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 Number of Eigenvalues Number of Eigenvalues (a) with “-singleCompThread” (b) without “-singleCompThread”
Yin Zhang (RICE) EIGPEN February, 2013 24 / 39 Experiment Environment
Running Platform A single node of a Cray XE6 supercomputer (NERSC) Two 12-core AMD ‘MagnyCours’ 2.1-GHz processors 32 GB shared memory System and language: Cray Linux Environment version 3 Fortran + OpenMP All 24 cores are used unless otherwise specified Solvers Tested ARPACK LOBPCG EIGPEN
Yin Zhang (RICE) EIGPEN February, 2013 25 / 39 Relative Error Measurements
Let x1 x2 ··· xk be computed Ritz vectors, and θi Ritz values.
Eigenvectors: kAxi − θixik2 resi = max(1, |θi|) Eigenvalues: |θi − λi| eθ = max , i max(1, |λi|) Trace: Pk Pk | θi − λi| e = i i trace Pk max(1, | i λi|)
Yin Zhang (RICE) EIGPEN February, 2013 26 / 39 Test Matrices
UF Sparse Matrix Collection: PARSEC group (Y.K. Zhou et al.) Matrix size n, sparsity nnz (#eigen-pairs nev for Test 1) no. Name n nnz (nev) 1 Andrews 60000 410077 600 2 C60 17576 212390 *200* 3 c 65 48066 204247 500 4 cfd1 70656 948118 700 5 finance 74752 335872 700 6 Ga10As10H30 113081 3114357 1000 7 Ga3As3H12 61349 3016148 600 8 OPF3754 15435 82231 *200* 9 shallow water1 81920 204800 800 10 Si10H16 17077 446500 *200* 11 Si5H12 19896 379247 *200* 12 SiO 33401 675528 400 13 wathen100 30401 251001 300
Yin Zhang (RICE) EIGPEN February, 2013 27 / 39 Numerical Results: Time
Table : A comparison of total wall clock time. “– –” indicates a “failure”.
tol = 1e−2 tol = 1e−4 Matrix ARPACK LOBPCG EigPen ARPACK LOBPCG EigPen Andrews 2956 575 159 3344 1160 496 C60 57 29 19 59** 52 44 c 65 – – 331 3099 – – – – 10030 cfd1 – – 815 233 – – 2883 1547 finance 13940 968 472 19570 4629 1122 Ga10As10H30 – – 5390 1848 – – – – 5531 Ga3As3H12 4871 771 563 6587 – – 1600 OPF3754 23 8 10 23** 28 17 shallow water1 2528 642 215 18590 3849 951 Si10H16 73 77 24 78** 100 86 Si5H12 103 86 24 114** 114 38 SiO 789 265 81 840 1534 287 wathen100 828 219 89 869 1103 219
“– –”: abnormal termination or wall-clock time > 6 hours “ ** ”: ARPACK time for problem with nev ≤ 200
Yin Zhang (RICE) EIGPEN February, 2013 28 / 39 CPU Time: EIGPEN vs. LOBPCG
EigPen, tol=1e−2 EigPen, tol=1e−4 10^4 LOBPCG, tol=1e−2 LOBPCG, tol=1e−4
10^3 wall clock time 10^2
10
C60 SiO cfd1 c_65 Si5H12 finance OPF3754 Si10H16 Andrews wathen100 Ga3As3H12 shallow_water1 Ga10As10H30 matrix (LOBPCG reached the 6-hour limit on the last 3 problems)
Yin Zhang (RICE) EIGPEN February, 2013 29 / 39 Numerical Results: Trace Error
Table : A comparison of e(trace) among different solvers.
tol = 1e−2 tol = 1e−4 Matrix ARPACK LOBPCG EigPen ARPACK LOBPCG EigPen Andrews 1.51e-02 2.78e-04 2.31e-03 4.54e-06 4.58e-05 4.58e-05 C60 4.48e-06 1.78e-04 5.88e-04 4.48e-06 4.34e-05 4.34e-05 c 65 – – 2.43e-04 1.52e-03 – – – – 4.79e-05 cfd1 – – 2.72e-03 6.33e-03 – – 5.37e-07 3.90e-06 finance 5.19e-02 1.28e-03 4.78e-03 7.59e-05 4.80e-05 4.80e-05 Ga10As10H30 – – 2.98e-04 3.83e-04 – – – – 4.97e-05 Ga3As3H12 3.02e-02 1.70e-04 9.15e-04 5.50e-03 – – 4.69e-05 OPF3754 3.59e-06 6.74e-04 8.80e-04 3.59e-06 2.22e-05 2.22e-05 shallow water1 3.96e-01 5.75e-03 5.80e-03 3.85e-04 8.64e-06 8.42e-06 Si10H16 2.83e-02 5.01e-05 9.77e-05 2.53e-02 4.33e-05 4.33e-05 Si5H12 5.52e-02 6.11e-05 3.35e-04 4.38e-06 3.86e-05 3.86e-05 SiO 2.40e-02 9.14e-05 1.42e-03 4.43e-06 4.81e-05 4.81e-05 wathen100 5.27e-03 5.74e-05 9.11e-04 3.93e-06 3.17e-05 3.17e-05
Yin Zhang (RICE) EIGPEN February, 2013 30 / 39 Numerical Results: Eigenvalue Error
Table : A comparison of maxi e(θi) among different solvers.
tol = 1e−2 tol = 1e−4 Matrix ARPACK LOBPCG EigPen ARPACK LOBPCG EigPen Andrews 1.01e-03 1.48e-05 1.06e-04 5.87e-09 1.07e-07 1.07e-07 C60 1.72e-07 4.20e-06 7.48e-06 1.72e-07 9.01e-08 9.01e-08 c 65 – – 5.73e-06 2.98e-05 – – – – 8.20e-07 cfd1 – – 2.36e-01 5.37e-01 – – 1.43e-06 1.11e-05 finance 8.13e-03 1.18e-04 4.99e-04 2.35e-07 3.91e-07 3.91e-07 Ga10As10H30 – – 1.33e-05 1.50e-05 – – – – 3.10e-07 Ga3As3H12 1.82e-03 8.57e-06 5.27e-05 6.38e-05 – – 1.01e-06 OPF3754 3.47e-09 4.87e-05 1.77e-05 3.47e-09 8.84e-08 8.84e-08 shallow water1 1.30e-01 2.38e-03 1.77e-03 2.03e-05 1.70e-07 1.49e-07 Si10H16 2.97e-03 7.60e-06 1.97e-06 1.91e-03 3.16e-06 3.16e-06 Si5H12 1.58e-03 7.62e-06 1.88e-05 2.12e-07 5.50e-07 5.50e-07 SiO 7.61e-04 6.01e-06 3.89e-05 1.72e-07 1.29e-06 1.29e-06 wathen100 2.66e-04 1.18e-05 2.43e-05 6.94e-08 3.05e-07 3.05e-07
Yin Zhang (RICE) EIGPEN February, 2013 31 / 39 Numerical Results: Eigenvector Error
Table : A comparison of resi among different solvers.
tol = 1e−2 tol = 1e−4 Matrix ARPACK LOBPCG EigPen ARPACK LOBPCG EigPen Andrews 2.14e+02 9.58e-03 9.30e-03 2.44e+01 9.20e-05 3.14e-05 C60 1.77e-04 9.83e-03 5.23e-03 8.67e-07 9.31e-05 7.59e-05 c 65 – – 9.60e-03 9.72e-03 – – – – 7.22e-05 cfd1 – – 9.50e-03 6.22e-03 – – 9.80e-05 5.84e-05 finance 9.43e-03 9.94e-03 8.23e-03 5.86e-05 9.96e-05 7.29e-05 Ga10As10H30 – – 9.97e-03 5.99e-03 – – – – 2.72e-05 Ga3As3H12 7.92e-03 8.61e-03 8.79e-03 7.43e-05 – – 3.73e-05 OPF3754 1.19e-04 9.21e-03 5.34e-03 8.21e-05 4.96e-05 7.55e-05 shallow water1 1.00e-02 9.90e-03 6.70e-03 8.90e-05 9.61e-05 3.96e-05 Si10H16 2.44e-03 8.90e-03 3.44e-03 1.45e-05 9.01e-05 6.54e-05 Si5H12 1.99e-03 9.56e-03 6.51e-03 4.56e-05 9.78e-05 8.98e-05 SiO 1.37e-03 1.00e-02 9.11e-03 3.28e-06 9.46e-05 1.03e-05 wathen100 9.96e-03 9.60e-03 6.22e-03 1.13e-05 7.72e-05 9.95e-05
Yin Zhang (RICE) EIGPEN February, 2013 32 / 39 Time vs. Eigenspace Dimension
12000 EigPen EigPen 12000 LOBPCG LOBPCG
10000 10000
8000 8000
6000 6000 wall clock time wall clock time
4000 4000
2000 2000
500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 nev nev (c) Andrews (tol = 1e-2) (d) Ga3As3H12 (tol = 1e-2)
Figure : Wall clock time vs. nev = 500, ··· , 3000 (about 5% eigen-pairs)
Yin Zhang (RICE) EIGPEN February, 2013 33 / 39 4 Types of Computation Times (%) Andrews. n = 60000. nev = 500, ··· , 3000
90 90 SpMV BLAS3 80 80 SpMV RR DLACPY BLAS3 70 70 RR DLACPY 60 60
50 50
40 40
30 30 percentage of wall clock time (%) percentage of wall clock time (%) 20 20
10 10
0 0 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 nev nev (a) EigPen (b) LOBPCG SpMV: Sparse Matrix-Vector. BLAS3: Matrix-Matrix. RR: Rayleigh-Ritz. DLACPY: Matrix Copying
Yin Zhang (RICE) EIGPEN February, 2013 34 / 39 Parallel Speedup Factors
Recall: A single node of a Cray XE6 with 24 computing cores (two 12-core AMD “MagnyCours” 2.1-GHz processors)
Experiment setup: Speedup-Factor for running on p cores:
run time using 1 core Speedup-Factor(p) = run time using p cores
We run the 2 solvers for p = 2, 4, 8, 16, 24.
Yin Zhang (RICE) EIGPEN February, 2013 35 / 39 Parallel Speedup Factors Ga3As3H12. n = 61349. nev = 1500.
20 Total 20 Total SpMV SpMV 18 BLAS3 18 BLAS3 RR RR 16 DLACPY 16 DLACPY
14 14
12 12
10 10
8 8
parallel speedup factor 6 parallel speedup factor 6
4 4
2 2
2 4 8 16 24 2 4 8 16 24 number of cores number of cores (c) EigPen (d) LOBPCG total time is in red
Yin Zhang (RICE) EIGPEN February, 2013 36 / 39 Preconditioning: Proof of Concept in Matlab
Preconditioned gradient method with pre-conditioner M ∈ Rn×n:
j+1 j j −1 j X = X − α M ∇fµ(X ). Matrix: c 65. n = 48066, nev = 500. M = LL T , L = ichol(A).
5 5 10 10 without preconditioning with preconditioning
4 4 10 10
3 3 10 10
2 2 10 10 F F )|| )|| j j 1 1 (X (X
µ 10 µ 10 f f ∇ ∇ || || 0 0 10 10
−1 −1 10 10
−2 −2 10 10
−3 −3 10 10 0 200 400 600 800 1000 1200 0 50 100 150 200 250 iteration iteration (e) without preconditioning (Iter > 1200) (f) with preconditioning (Iter < 300)
j Figure : Gradient norm k∇fµ(X )kF Progress
Yin Zhang (RICE) EIGPEN February, 2013 37 / 39 Concluding Remarks
Why? Eigenspace dimension tips balance of computation Parallelism demands different thinking How? Trace-Penalty Model: penalty function is “exact” Model yields fewer or even no saddle points Orthogonalization and RR can be greatly reduced What? Efficient for moderate accuracy, numerically stable Pre-conditioning can be effectively applied BLAS3-rich, parallel scalability appears promising Future: Enhancements/Refinements/Extensions
Yin Zhang (RICE) EIGPEN February, 2013 38 / 39 Thank you for your attention!
Yin Zhang (RICE) EIGPEN February, 2013 39 / 39