<<

Dynamical approach to random theory

L´aszl´oErd˝os,∗ Horng-Tzer Yau†

May 9, 2017

∗Partially supported by ERC Advanced Grant, RANMAT 338804 †Partially supported by the NSF grant DMS-1307444 and a Simons Investigator Award

1 AMS Subject Classification (2010): 15B52, 82B44 Keywords: Random matrix, local semicircle law, Dyson sine kernel, Wigner-Dyson-Mehta conjecture, Tracy-Widom distribution, Dyson Brownian motion.

2 Preface This book is a concise and self-contained introduction of the recent techniques to prove local spectral universality for large random matrices. Random matrix theory is a fast expanding research area and this book mainly focuses on the methods we participated in developing over the past few years. Many other interesting topics are not included, nor are several new developments within the framework of these methods. We have chosen instead to present key concepts that we believe are the core of these methods and should be relevant for future applications. We keep technicalities to a minimum to make the book accessible to graduate students. With this in mind, we include in this book the basic notions and tools for high dimensional analysis such as large deviation, entropy, Dirichlet form and logarithmic Sobolev inequality. The material in this book originates from our joint works with a group of collaborators in the past several years. Not only were the main mathematical results in this book taken from these works, but the presentation of many sections followed the routes laid out in these papers. In alphabetical order, these coauthors were Paul Bourgade, Antti Knowles, Sandrine P´ech´e,Jose Ram´ırez, Benjamin Schlein and Jun Yin. We would like to thank all of them. This manuscript was developed and continuously improved over the last five years. We have taught this material in several regular graduate courses at Harvard, Munich and Vienna, in addition to various summer schools and short courses. We are thankful for the generous support of the Institute for Advanced Studies, Princeton, where part of this manuscript was written during the special year devoted to random matrices in 2013-2014. L.E. also thanks Harvard University for the continuous support during his numerous visits. L.E. was partially supported by the SFB TR 12 grant of the German Science Foundation and the ERC Advanced Grant, RANMAT 338804 of the European Research Council. H.-T. Y. would like to thank the National Center for the Theoretic Sciences at the National Taiwan University, where part of the manuscript was written, for the hospitality and support for his several visits. H.-T. Y. gratefully acknowledges the support from NSF DMS-1307444 and a Simons Investigator award. Finally, we are grateful to the editorial support from the publisher, to Amol Aggarwal, Johannes Alt, Patrick Lopatto for careful reading of the manuscript and to Alex Gontar for his help in composing the bibliography.

3 Contents

1 Introduction 6

2 Wigner matrices and their generalizations 10

3 Eigenvalue density 11 3.1 Wigner semicircle law and other canonical densities ...... 11 3.2 The moment method ...... 12 3.3 The resolvent method and the Stieltjes transform ...... 14

4 Invariant ensembles 16 4.1 Joint density of eigenvalues for invariant ensembles ...... 16 4.2 Universality of classical invariant ensembles via ...... 18

5 Universality for generalized Wigner matrices 24 5.1 Different notions of universality ...... 24 5.2 The three-step strategy ...... 25

6 Local semicircle law for universal Wigner matrices 27 6.1 Setup ...... 27 6.2 Spectral information on S ...... 29 6.3 Stochastic domination ...... 30 6.4 Statement of the local semicircle law ...... 31 6.5 Appendix: Behaviour of Γ and Γe and the proof of Lemma 6.3 ...... 33

7 Weak local semicircle law 36 7.1 Proof of the weak local semicircle law, Theorem 7.1 ...... 36 7.2 Large deviation estimates ...... 45

8 Proof of the local semicircle law 48 8.1 Tools...... 48 8.2 Self-consistent equations on two levels ...... 50 8.3 Proof of the local semicircle law without using the spectral gap ...... 52

9 Sketch of the proof of the local semicircle law using the spectral gap 62

10 Fluctuation averaging mechanism 65 10.1 Intuition behind the fluctuation averaging ...... 65 10.2 Proof of Lemma 8.9 ...... 66 10.3 Alternative proof of (8.47) of Lemma 8.9 ...... 73

11 Eigenvalue location: the rigidity phenomenon 76 11.1 Extreme eigenvalues ...... 76 11.2 Stieltjes transform and regularized counting function ...... 76 11.3 Convergence speed of the empirical distribution function ...... 79 11.4 Rigidity of eigenvalues ...... 80

12 Universality for matrices with Gaussian convolutions 82 12.1 Dyson Brownian motion ...... 82 12.2 Derivation of Dyson Brownian motion and perturbation theory ...... 84 12.3 Strong local ergodicity of the Dyson Brownian motion ...... 85 12.4 Existence and restriction of the dynamics ...... 88

4 13 Entropy and the logarithmic Sobolev inequality (LSI) 92 13.1 Basic properties of the entropy ...... 92 13.2 Entropy on product spaces and conditioning ...... 94 13.3 Logarithmic Sobolev inequality ...... 96 13.4 Hypercontractivity ...... 102 13.5 Brascamp-Lieb inequality ...... 104 13.6 Remarks on the applications of the LSI to random matrices ...... 105 13.7 Extensions to the simplex, regularization of the DBM ...... 108

14 Universality of the Dyson Brownian Motion 113 14.1 Main ideas behind the proof of Theorem 14.1 ...... 114 14.2 Proof of Theorem 14.1 ...... 115 14.3 Restriction to the simplex via regularization ...... 120 14.4 Dirichlet form inequality for any β > 0...... 121 14.5 From gap distribution to correlation functions: proof of Theorem 12.4...... 122 14.6 Details of the proof of Lemma 14.8 ...... 123

15 Continuity of local correlation functions under matrix OU process 127 15.1 Proof of Theorem 15.2 ...... 129 15.2 Proof of the correlation function comparison Theorem 15.3 ...... 131

16 Universality of Wigner matrices in small energy window: GFT 136 16.1 The Green function comparison theorems ...... 136 16.2 Conclusion of the three step strategy ...... 139

17 Edge Universality 142

18 Further results and historical notes 150 18.1 Wigner matrices: bulk universality in different senses ...... 150 18.2 Historical overview of the three-step strategy for bulk universality ...... 151 18.3 Invariant ensembles and log-gases ...... 153 18.4 Universality at the spectral edge ...... 155 18.5 Eigenvectors ...... 155 18.6 General mean field models ...... 157 18.7 Beyond mean field models: band matrices ...... 158

5 1 Introduction “Perhaps I am now too courageous when I try to guess the distribution of the distances between successive levels (of energies of heavy nuclei). Theoretically, the situation is quite simple if one attacks the problem in a simpleminded fashion. The question is simply what are the distances of the characteristic values of a with random coefficients.” on the Wigner surmise, 1956

Random matrices appeared in the literature as early as 1928, when Wishart [138] used them in . The natural question regarding their eigenvalue statistics, however, was not raised until the pioneering work [137] of Eugene Wigner in the 1950s. Wigner’s original motivation came from nuclear physics when he noticed from experimental data that gaps in energy levels of large nuclei tend to follow the same statistics irrespective of the material. Quantum mechanics predicts that energy levels are eigenvalues of a self-adjoint operator, but the correct Hamiltonian operator describing nuclear forces was not known at that time. In addition, the computation of the energy levels of large quantum systems would have been impossible even with the full Hamiltonian explicitly given. Instead of pursuing a direct solution of this problem, Wigner appealed to a phenomenological model to explain his observation. Wigner’s pioneering idea was to model the complex Hamiltonian by a random matrix with independent entries. All physical details of the system were ignored except one, the symmetry type: systems with time reversal symmetry were modeled by real symmetric random matrices, while complex Hermitian random matrices were used for systems without time reversal symmetry (e.g. with magnetic forces). This simple-minded model amazingly reproduced the correct gap statistics, indicating a profound universality principle working in the background. Notwithstanding their physical significance, random matrices are also very natural mathematical objects and their studies could have been initiated by mathematicians driven by pure curiosity. A large number of random numbers and vectors have been known to exhibit universal patterns; the obvious examples are the law of large numbers and the central limit theorem. What are their analogues in the non-commutative setting, e.g. for matrices? Focusing on the spectrum, what do eigenvalues of typical large random matrices look like? As the first result of this type, Wigner proved a type of law of large numbers for the density of eigenvalues, which we now explain. The (real or complex) Wigner ensembles consist of N × N self-adjoint matrices H = (hij) with matrix elements having mean zero and variance 1/N that are independent up to the symmetry constraint hij = hji. The Wigner semicircle law states that the empirical density of the eigenvalues of H is 1 p 2 given by the semicircle law, %sc(x) = 2π (4 − x )+, as N → ∞, independent of the details of the distribution of hij. On the scale of individual eigenvalues, Wigner predicted that the fluctuations of the gaps are universal and their distribution is given by a new law, the Wigner surmise. This might be viewed as the random matrix analogue of the central limit theorem. After Wigner’s discovery, Dyson, Gaudin and Mehta achieved several fundamental mathematical results, in particular they were able to compute the gap distribution and the local correlation functions of the eigenvalues for Gaussian ensembles. They are called Gaussian Orthogonal Ensemble (GOE) and Gaussian Unitary Ensemble (GUE) corresponding to the two most important symmetry classes, the real symmetric and complex Hermitian matrices. The Wigner surmise turned out to be slightly wrong and the correct law is given by the Gaudin distribution. Dyson and Mehta gradually formulated what is nowadays known as the Wigner-Dyson-Mehta (WDM) universality conjecture. As presented in the classical treatise of Mehta [105], this conjecture asserts that the local eigenvalue statistics for large random matrices with independent entries are universal, i.e., they do not depend on the particular distribution of the matrix elements. In particular, they coincide with those in the Gaussian case that were computed explicitly. On a more philosophic level, we can recast Wigner’s vision as the hypothesis that the eigenvalue gap distributions for large complicated quantum systems are universal in the sense that they depend only on the symmetry class of the physical system but not on other detailed structures. Therefore, the Wigner-Dyson- Mehta universality conjecture is merely a test of Wigner’s hypothesis for a special class of matrix models, the Wigner ensembles, which are characterized by the independence of their matrix elements. The other large

6 class of matrix models is the invariant ensembles. They are defined by a Radon-Nikodym density of the form exp(− Tr V (H)) with respect to the flat Lebesgue measure on the space of real symmetric or complex Hermitian matrices H. Here V is a real valued function called the potential of the invariant ensemble. These distributions are invariant under orthogonal or unitary conjugation, but the matrix elements are not independent except for the Gaussian case. The universality conjecture for invariant ensembles asserts that the local eigenvalue gap statistics are independent of the function V . In contrast to the Wigner case, substantial progress was made for invariant ensembles in the last two decades. The key element of this success was that invariant ensembles, unlike Wigner matrices, have explicit formulas for the joint densities of the eigenvalues. These formulas express the eigenvalue correlation functions as determinants whose entries are given by functions of orthogonal polynomials. The limiting local eigenvalue statistics are thus determined by the asymptotic behavior of these formulas and in particular those of the orthogonal polynomials as the size of the matrix tends to infinity. A key important tool, the Riemann-Hilbert method [71], was brought into this subject by Fokas, Its and Kitaev, and the universality of eigenvalue statistics was established for large classes of invariant ensembles by Bleher-Its [20] and by Deift and collaborators [37]– [41]. Behind the spectacular progress in understanding the local statistics of invariant ensembles, the corner- stone of this approach lies in the fact that there are explicit formulas to represent eigenvalue correlation functions by orthogonal polynomials – a key observation made in the original work of Gaudin and Mehta for Gaussian random matrices. For Wigner ensembles, there are no explicit formulas for the eigenvalue statistics and the WDM conjecture was open for almost fifty years with virtually no progress. The first significant advance in this direction was made by Johansson [85], who proved the universality for complex Hermitian matrices under the assumption that the common distribution of the matrix entries has a substan- G tial Gaussian component, i.e., the random matrix H is of the form H = H0 + aH where H0 is a general Wigner matrix, HG is the GUE matrix, and a is a certain, not too small, positive constant independent of N. His proof relied on an explicit formula by Br´ezinand Hikami [30, 31] that uses a certain version of the Harish-Chandra-Itzykson-Zuber formula [84]. These formulas are available for the complex Hermitian case only, which restricted the method to this symmetry class. If local spectral universality is so ubiquitous as Wigner, Dyson and Mehta conjectured, there must be an underlying mechanism driving local statistics to their universal fixed point. In hindsight, the existence of such a mechanism is almost a synonym of “universality”. However, up to ten years ago there was no clear indication whether a solution of the WDM conjecture would rely on some yet to be discovered formulas or would come from some other deep insight. The goal of this book is to give a self-contained introduction to a new approach to local spectral uni- versality which, in particular, resolves the WDM conjecture. To keep technicalities to a minimum, we will consider the simplest class of matrices that we call generalized Wigner matrices (Definition 2.1) and we will focus on proving universality in a simpler form, in the averaged energy sense (see Section 5.1) and the necessary prerequisites. We stress that these are not the strongest results that are currently available, but they are good representatives of the general ideas we have developed in the past several years. We believe that these techniques are sufficiently mature to be presented in a book format. This approach consists of the following three steps: Step 1. Local semicircle law: It provides an a priori estimate showing that the density of eigenvalues of generalized Wigner matrices is given by the semicircle law at very small microscopic scales, i.e., down to spectral intervals that contain N ε eigenvalues. Step 2. Universality for Gaussian divisible ensembles: It proves that the local statistics of Gaussian divisible G G −1/2+ε ensembles H0 + aH are the same as those of the Gaussian ensembles H as long as a > N , i.e., already for very small a. Step 3. Approximation by a Gaussian divisible ensemble: It is a type of “density argument” that extends the local spectral universality from Gaussian divisible ensembles to all Wigner ensembles.

The conceptually novel point of our method is√ Step 2. The eigenvalue distributions of the Gaussian −t/2 −t G divisible ensembles, written in the form e H0 + 1 − e H , are the same as that of the solution of a matrix valued Ornstein-Uhlenbeck (OU) process Ht for any time t > 0. Dyson [45] observed half a century

7 ago that the dynamics of the eigenvalues of Ht is given by an interacting stochastic particle system, called the Dyson Brownian motion (DBM). In addition, the invariant measure of this dynamics is exactly the eigenvalue distribution of GOE or GUE. This invariant measure is also a Gibbs measure of point particles in one dimension interacting via a long range logarithmic potential. Using a heuristic physical argument, Dyson −1 remarked [45] that the DBM reaches its “local equilibrium” on a short time scale t & N . We will refer to this as Dyson’s conjecture, although it was rather an intuitive physical picture than an exact mathematical statement. Since Dyson’s work in the sixties, there was virtually no progress in proving this conjecture. Besides the limit of available mathematical tools, one main factor is the vague formulation of the conjecture involving the notion of “local equilibrium”, which even nowadays is not well-defined for a Gibbs measure with a general long range interaction. Furthermore, a possible connection between Dyson’s conjecture and the solution of the WDM conjecture has never been elucidated in the literature. In fact, “relaxation to local equilibrium” in a time scale t refers to the phenomenon that after time t the dynamics has changed the system, at least locally, from its initial state to a local equilibrium. It therefore appears counterintuitive that one may learn anything useful in this way about the WDM conjecture, since the latter concerns the initial state. The key point is that by applying local relaxation to all initial states (within a reasonable class) simultaneously, Step 2 generates a large set of random matrix ensembles for which universality holds. We prove that, for the purpose of universality, this set is sufficiently dense√ so that any −t/2 −t G Wigner matrix H is sufficiently close to a Gaussian divisible ensemble of the form e H0 + 1 − e H with a suitably chosen H0. Originally, our motivation was driven not by Dyson’s conjecture, but by the desire to prove the universality for Gaussian divisible ensembles with the Gaussian component as small as possible. As it turned out, our method used in Step 2 actually proved a precisely formulated version of Dyson’s conjecture. The applicability of our approach actually goes well beyond the original scope; in the last years the method has been extended to invariant ensembles, sample covariance matrices, adjacency matrices of sparse random graphs and certain band matrices. At the end of the book we will give a short overview of these applications. In the following, we outline the main contents of the book. In Sections 2–4 we start with an elementary recapitulation of some well known concepts, such as the moment method, the orthogonal polynomial method, the emergence of the sine kernel, the Tracy-Widom distribution and the invariant ensembles. These sections are included only to provide backgrounds for the topics discussed in this book; details are often omitted, since these issues are discussed extensively in other books. The core material starts from Section 5 with a precise definition of different concepts of universality and with the formulation of a representative theorem (Theorem 5.1) that we eventually prove in this book. We also give here a detailed overview of the three step strategy. The first step, the local version of Wigner’s semicircle law is given in Section 6. For pedagogical reasons, we give the proof on several levels. A weaker version is given in Section 7 whose proof is easier; an ambitious reader may skip this section. The proof of the stronger version is found in Section 8 which gives the optimal result in the bulk of the spectrum. To obtain the optimal result at the edge, an additional idea is needed; this is sketched in Section 9, but we do not give all the technical details. This section may be skipped at first reading. A key ingredient of the proofs, the fluctuation averaging mechanism, is presented separately in Section 10. Important consequences of the local law, such as rigidity, speed of convergence and eigenvector delocalization, are given in Section 11. The second step, the analysis of the Dyson Brownian motion and the proof of Dyson’s conjecture, is presented in Section 12 and Section 14. Before the proof in Section 14, we added a separate Section 13 devoted to general tools of very large dimensional analysis, where we introduce the concepts of entropy, Dirichlet form, Bakry-Emery´ method, and logarithmic Sobolev inequality. Some concepts (e.g. the Brascamp-Lieb inequality and hypercontractivity) are not used in this book but we included them since they are useful and related topics. The third step, the perturbative argument to compare arbitrary Wigner matrices with Gaussian divisible matrices, is given in two versions. A shorter but somewhat weaker version, the continuity of the correlation functions, is found in Section 15. A technically more involved version, the Green function comparison theorem is presented in Section 16. This completes the proof of our main result.

8 Separately, in Section 17 we prove the universality at the edge; the main input is the strong form of the local semicircle law at the edge. Finally, we give a historical overview, discuss newer results, and provide some outlook on related models in Section 18. This book is not a comprehensive monograph on random matrices. Although we summarize a few general concepts at the beginning, we mostly focus on a concrete strategy and the necessary technical steps behind it. For readers interested in other aspects, in addition to the classical book of Mehta [105], several excellent works are available that present random matrices in a broader scope. The books by Anderson, Guionnet and Zeitouni [8] and Pastur and Shcherbina [112] contain extensive material starting from the basics. Tao’s book [127] provides a different aspect to this subject and is self-contained as a graduate textbook. Forrester’s monograph [72] is a handbook for any explicit formulas related to random matrices. Finally, [6] is an excellent comprehensive overview of diverse applications of random matrix theory in mathematics, physics, neural networks and engineering. Finally, we list a few notational conventions. In order to focus on the essentials, we will not follow the dependencies of various constants on different parameters. In particular, we will use the generic letters C and c to denote positive constants, whose values may change from line to line and which may depend on some fixed basic parameters of the model. For two positive quantities A and B, we will write A . B to indicate that there exists a constant C such that A 6 CB. If A and B are comparable in the sense that A . B and B . A, then we write A  B. We introduce the notation A, B := Z ∩ [A, B] for the set of integers between any two real numbers A < B. J K

9 2 Wigner matrices and their generalizations

Consider N × N square matrices of the form   h11 h12 . . . h1N  h21 h22 . . . h2N  H = H(N) =   (2.1)  . . .   . . .  hN1 hN2 . . . hNN

with centered entries E hij = 0, i, j = 1, 2,...,N. (2.2)

The random variables hij, i, j = 1,...,N are real or complex random variables subject to the symmetry ¯ (N) constraint hij = hji so that H is either Hermitian (complex) or symmetric (real). Let S = S denote the matrix of variances   s11 s12 . . . s1N  s21 s22 . . . s2N  S :=   , s := |h |2. (2.3)  . . .  ij E ij  . . .  sN1 sN2 . . . sNN We assume that N X sij = 1, for every i = 1, 2,...,N, (2.4) j=1

i.e., the deterministic N × N matrix of variances, S = (sij), is symmetric and doubly stochastic. The size of the random matrix, N, is the fundamental limiting parameter throughout this book. Most results become sharp in the large N limit. Many quantities, such as the distribution of H, the matrix S, naturally depend on N, but for notational simplicity we will often omit this dependence from the notation. The simplest case, when 1 s = i, j = 1, 2,...,N, ij N is called the Wigner matrix ensemble. We summarize this setup in the following definition:

Definition 2.1. An N × N symmetric or Hermitian random matrix (2.1) is called a universal Wigner matrix 2 (ensemble) if the entries are centered (2.2), their variances sij = E|hij| satisfy X sij = 1, i = 1, 2,...,N, (2.5) j and {hij : i 6 j} are independent. An important subclass of the universal Wigner ensembles is called generalized Wigner matrices (ensembles) if, additionally, the variances are comparable, i.e.,

0 < Cinf 6 Nsij 6 Csup < ∞, i, j = 1, 2,...,N, (2.6)

holds with some fixed positive constants Cinf , Csup. For Hermitian ensembles, we additionally require that for each i, j the 2 × 2 is bounded by C/N in matrix sense, i.e.,

 2  E( . E(

1 In the special case sij = 1/N and Σij = 2N I2×2 we recover the original definition of the Wigner matrices or Wigner ensemble [137].

10 In this book we will focus on the generalized Wigner matrices. These are mean-field models in the sense that the typical size of the matrix elements hij are comparable, see (2.6). In Section 18.7 we will discuss models beyond the mean field regime. The most prominent Wigner ensembles are the Gaussian Orthogonal Ensemble (GOE) and the Gaussian Unitary Ensemble√ (GUE); i.e., real symmetric and complex Hermitian Wigner matrices with rescaled√ matrix elements Nhij being standard Gaussian variables.√ More precisely, in the Hermitian case, Nhij is a 2 standard complex Gaussian variable, i.e., E| Nhij| =√ 1 with the real and imaginary part independent having the same variance when i 6= j. Furthermore, Nh√ ii is a real standard√ Gaussian variable with 2 2 variance 1. For the real symmetric case, we require that E| Nhij| = 1 = E| Nhii| /2 for any i 6= j. For simplicity of the presentation, in the case of the Wigner ensembles, we will assume that not only the variances of hij, i < j are identical, but they are identically√ distributed. In this case we fix a distribution ν and we assume that the rescaled matrix elements Nhij are distributed according to ν. Depending on the symmetry type, the diagonal elements may have a slightly different distribution, but we will omit this subtlety from the discussion. The distribution ν will be called the single entry distribution of H. In the case −1/2 of generalized Wigner matrices, we assume that the rescaled matrix elements sij hij, with unit variance, all have law ν. We will need some decay property of ν, and we will consider two decay types. Either we assume that ν has subexponential decay, i.e., that there are constants C, ϑ > 0 such that for any s > 0 Z  ϑ 1 |x| > s dν(x) 6 C exp −s ; (2.7) R or we assume that ν has a polynomial decay with arbitrary high power, i.e., Z  −p 1 |x| > s dν(x) 6 Cp(1 + s) , ∀ s > 0, (2.8) R for any p ∈ N. Another class of random matrices, which even predate Wigner, are the random sample covariance ma- trices. These are matrices of the form H = X∗X, (2.9)

2 −1 where X is a rectangular M × N matrix with centered i.i.d. entries with variance E|Xij| = M . Note that the matrix elements of H are not independent, but they are generated from the independent matrix elements of X in a straightforward way. These matrices appear in statistical samples and were first considered by Wishart [138]. In the case when Xij are centered Gaussian, the random covariance matrices are called Wishart matrices or ensemble.

3 Eigenvalue density 3.1 Wigner semicircle law and other canonical densities

For real symmetric or complex Hermitian Wigner matrix H, let λ1 6 λ2 6 ... 6 λN denote the eigenvalues. The empirical distribution of eigenvalues follows a universal pattern, the Wigner semicircle law. To formulate it more precisely, note that the typical spacing between neighboring eigenvalues is of order 1/N, so in a fixed interval [a, b] ⊂ R, one expects macroscopically many (of order N) eigenvalues. More precisely, it can be shown (first proof was given by Wigner [137]) that for any fixed a 6 b real numbers, Z b 1  1 p 2 lim # i : λi ∈ [a, b] = %sc(x)dx, %sc(x) := (4 − x )+, (3.1) N→∞ N a 2π

where (a)+ := max{a, 0} denotes the positive part of the number a. Note the emergence of the universal density, the semicircle law, that is independent of the details of the distribution of the matrix elements. The

11 semicircle law holds beyond Wigner matrices: it characterizes the eigenvalue density of the universal Wigner matrices (see Definition 2.1). For the random covariance matrices (2.9), the empirical density of eigenvalues λi of H converges to the Marchenko-Pastur law [103] in the limit when M,N → ∞ such that d = N/M is fixed 0 < d 6 1: s Z b (λ − x)(x − λ ) 1  1 + − + lim # i : λi ∈ [a, b] = %MP (x)dx, %MP (x) := 2 (3.2) N→∞ N a 2πd x √ 2 with λ± := (1 ± d) being the spectral edges. Note that in case M 6 N, the matrix H has macroscopically many zero eigenvalues, otherwise the spectra of XX∗ and X∗X coincide so the Marchenko-Pastur law can be applied to all nonzero eigenvalues with the role of M and N exchanged.

3.2 The moment method The eigenvalue density is commonly approached via the fairly robust moment method (see [8] for an expos´e) that was also the original approach of Wigner to prove the semicircle law [137]. The following proof is −1 formulated for Wigner matrices with identically distributed entries, in particular with variances sij = N , but a slight modification of the argument also applies to more general ensembles satisfying (2.5). The simplest moment is the second moment, Tr H2:

N N X 2 2 X 2 λi = Tr H = |hij| . i=1 i,j=1 Taking expectation and using (2.5) we have 1 X 1 X λ2 = s = 1. N E i N ij i ij In general, we can compute traces of all even powers of H, i.e.,

2k E Tr H by expanding the product as X E hi1i2 hi2i3 . . . hi2ki1 . (3.3) i1,i2,...i2k

Since the expectation of each matrix element vanishes, each factor hxy must be paired with at least another ¯ copy hyx = hxy, otherwise the expectation value is zero. In general, there is no need that the sequence form a perfect pairing. If we identify hyx with hxy, the only restriction is that the matrix element hxy should appear at least twice in the sequence. Under this condition, the main contribution comes from the index sequences which satisfy the perfect pairing condition such that each hxy is paired with a unique hyx in (3.3). These sequences can be classified according to their complexity, and it turns out that the main contribution comes from the so-called backtracking sequences. An index sequence i1i2i3 . . . i2ki1, returning to the original index i1, is called backtracking if it can be successively generated by a substitution rule

a → aba, b ∈ {1, 2,...,N}, b 6= a, (3.4) with an arbitrary index b. For example, we represent the term

hi1i2 hi2i3 hi3i2 hi2i4 hi4i5 hi5i4 hi4i2 hi2i1 , i1, i2, . . . , i5 ∈ {1, 2,...,N}, (3.5) in the expansion of Tr H8 (k = 4) by a walk of length 2 × 4 on the set {1, 2,...,N}. This path is generated by the operation (3.4) in the following order

i1 → i1i2i1 → i1i2i3i2i1 → i1i2i3i2i4i2i1 → i1i2i3i2i4i5i4i2i1, with i1 6= i2, i2 6= i3, i3 6= i4, i4 6= i5.

12 It may happen that two non-successive labels coincide (e.g. i1 = i3), but the contribution of such terms is by a factor 1/N less than the leading term so we can neglect them. Thus we assume that all labels i1, i2, i3, i4, i5 are distinct. We may bookkeep these paths by two objects:

a) a graph on vertices labelled by 1, 2, 3, 4, 5 (i.e., by the indices of the labels ij) and with edges defined by walking over the vertices in following order 1, 2, 3, 2, 4, 5, 4, 2, 1, i.e., the order of the indices of the labels in (3.5);

b) an assignment of a distinct label ij, j = 1, 2, 3, 4, 5, to each vertex. It is easy to see that the graph generated by backtracking sequences is a (double-edged) tree on the vertices 1, 2, 3, 4, 5. Now we count the combinatorics of the objects a) and b) separately. We start with a). Lemma 3.1. The number of graphs with k double-edges, derived from backtracking sequences, is explicitly 1 2k given by the Catalan numbers, Ck := k+1 k . Proof. There is a one to one correspondence between backtracking sequences and the number of nonnegative one dimensional random walks of length 2k returning to the origin at the time 2k. This is defined by the simple rule, that a forward step a → b in the substitution rule (3.4) will be represented by a step to the right in the 1d random walk, while the backtracking step b → a in (3.4) is step to the left. Since at any moment the number of backtracking steps cannot be larger than the number of forward steps, the walk remains on the nonnegative semi-axis. From this interpretation, it is easy to show that the recursive relation

n−1 X Cn = CkCn−k−1,C0 = C1 = 1, k=0 holds for the number Cn of the backtracking sequences of total length 2n. Elementary combinatorics shows that the solution to this recursion is given by the Catalan numbers, which proves the lemma. We remark that Cn has many other definitions. Alternatively, it is also the number of planar binary trees with n + 1 vertices and one could also use this definition to verify the lemma.

The combinatorics of label-assignments in step b) for backtracking paths of length 2k is N(N −1) ... (N − k) ≈ N k+1 for any fixed k in the N → ∞ limit. It is easy to see that among all paths of length 2k, the backtracking paths support the maximal number (k + 1) independent indices. The combinatorics for all non-backtracking paths is at most CN k, i.e., by a factor 1/N less, thus they are negligible. Hence E Tr H2k can be computed fairly precisely for each finite k: 1 1 2k Tr H2k = + O (N −1). (3.6) N E k + 1 k k

Note that the number of independent labels, N k+1, after dividing by N, exactly cancels the size of the k-fold product of variances, (E|h|2)k = N −k. The expectation of the traces of odd powers of H is negligible since they can never satisfy the pairing condition. One can easily check that Z 2k x ρsc(x)dx = Ck, (3.7) R i.e., the semicircle law is identified as the probability measure on R whose even moments are the Catalan numbers and the odd moments vanish. This proves that 1 Z Tr P (H) → P (x)% (x)dx (3.8) N E sc R for any polynomial P . With standard approximation arguments, the polynomials can be replaced with any continuous functions with compact support and even with characteristic functions of intervals. This proves

13 the semicircle law (3.1) for the Wigner matrices in the “expectation sense”. To rigorously complete the proof of the semicircle law (3.1) without expecatation, we also need to control the variance of Tr P (H) in the left hand side of (3.8). Although this can also be done by the moment method, we will not get into further details here; see [8] for a detailed proof. Finally, we also remark that there are index sequences in (3.3) that require to compute higher than the second moment of h. While the combinatorics of these terms is negligible, the proof above assumed m −m/2 polynomial decay (2.8), i.e., that all rescaled moments of h are finite, E|hij| 6 CmN . This condition can be relaxed by a cutoff argument which we skip here. The interested reader should consult [8] for more details.

3.3 The resolvent method and the Stieltjes transform An alternative approach to the eigenvalue density is via the Stieltjes transform. The empirical measure of the eigenvalues is defined by N 1 X % (dx) := δ(x − λ )dx. N N j j=1 The Stieltjes transform of the empirical measure is defined by

N Z 1 1 1 X 1 d%N (x) m(z) = m (z) := Tr = = (3.9) N N H − z N λ − z x − z j=1 j R

for any z = E + iη, E ∈ R, η > 0. Notice that m is simply the normalized trace of the resolvent of the random matrix H with spectral parameter z. The real part E = Re z will often be referred to ”energy”, alluring to the quantum mechanical interpretation of the spectrum of H. An important property of the Stieltjes transform of any measure on R is that its imaginary part is positive whenever Im z > 0. For large z one can expand mN as follows

∞ 1 1 1 X H m m (z) = Tr = − , (3.10) N N H − z Nz z m=0 so after taking the expectation, using (3.6) and neglecting the error terms, we get

∞ X 1 2k m (z) ≈ − z−(2k+1), (3.11) E N k + 1 k k=0 √ 1 2 which, after some calculus, can be identified as the Laurent series of 2 (−z + z − 4). The approximation becomes exact in the N → ∞ limit. Although the expansion (3.10) is valid only for large z, given that the limit is an analytic function of z, one can extend the relation

1 p 2 lim EmN (z) = (−z + z − 4) (3.12) N→∞ 2 by analytic continuation to the whole upper half plane z = E + iη, η > 0. It is an easy exercise to see that this is exactly the Stieltjes transform of the semicircle density, i.e., Z 1 p %sc(x)dx m (z) := (−z + z2 − 4) = . (3.13) sc 2 x − z R √ The square root function is chosen with a branch cut in the segment [−2, 2] so that z2 − 4  z at infinity. This guarantees that Im msc(z) > 0 for Im z > 0. Since the Stieltjes transform identifies the measure uniquely, and pointwise convergence of Stieltjes transforms implies weak convergence of measures, we obtain

E %N (dx) *%sc(x)dx. (3.14)

14 The relation (3.12) actually holds with high probability, i.e., for any z with Im z > 0,

1 p 2 lim mN (z) = (−z + z − 4), (3.15) N→∞ 2 in probability. In the next sections we will prove this limit with an effective error term via the resolvent method. The semicircle law can be identified in many different ways. The moment method in Section 3.2 utilized the fact that the moments of the semicircle density are given by the Catalan numbers (3.7), which also emerged as the normalized traces of powers of H, see (3.6). The resolvent method relies on the fact that mN −1 approximately satisfies a self-consistent equation, mN ≈ −(z + mN ) , that is very close to the quadratic equation that msc from (3.13) satisfies: 1 msc(z) = − . z + msc(z) In other words, in the resolvent method the semicircle density emerges via a specific relation for its Stieltjes transform. It turns out that this approach allows us to perform a much more precise analysis, especially in the short scale regime, where Im z approaches to 0 as a function of N. Since the Stieltjes transform of a measure at spectral parameter z = E + iη essentially identifies the measure around E on scale η > 0, a precise understanding of mN (z) for small Im z will yield a local version of the semicircle law. This will be explained in Section 6.

15 4 Invariant ensembles

4.1 Joint density of eigenvalues for invariant ensembles There is another natural way to define probability distributions on real symmetric or complex Hermitian matrices apart from directly imposing a given probability law ν on their entries. They are obtained by defining a density function directly on the set of matrices:

−1 P(H)dH := Z exp (−N Tr V (H))dH. (4.1) Q Here dH = dHij is the flat Lebesgue measure (in case of complex Hermitian matrices and i < j, dHij i6j is the Lebesgue measure on the complex plane C). The function V : R → R is assumed to grow mildly at infinity (some logarithmic growth would suffice) to ensure that the measure defined in (4.2) is finite, and Z is the normalization factor. Probability distributions of the form (4.1) are called invariant ensembles since they are invariant under the orthogonal or unitary conjugation (in case of symmetric or Hermitian matrices, respectively). For example, in the Hermitian case, for any fixed U, the transformation

H → U ∗HU

leaves the distribution (4.1) invariant thanks to Tr V (U ∗HU) = Tr V (H). Wigner matrices and invariant ensembles form two different universes with quite different mathematical tools available for their studies. In fact, these two classes are almost disjoint, the Gaussian ensembles being the only invariant Wigner matrices. This is the content of the following lemma:

Lemma 4.1 ( [37] or Theorem 2.6.3 [105]). Suppose that the real symmetric or complex ensembles given in (4.1) have independent entries hij, i 6 j. Then V (x) is a quadratic polynomial, V (x) = ax2 + bx + c with a > 0. This means that apart from a trivial shift and normalization, the ensemble is GOE or GUE.

For ensembles that remain invariant under the transformations H → U ∗HU for any unitary matrix U (or, in case of symmetric matrices H, for any U), the joint probability density function of all the N eigenvalues can be explicitly computed. These ensembles are typically given by

 β  (H)dH = Z−1 exp − N Tr V (H) dH. (4.2) P 2 which is of the form (4.1) with a traditional extra factor β/2 that makes some later formulas nicer. The parameter β is determined by the symmetry type; β = 1 for real symmetric ensembles and β = 2 for complex Hermitian ensembles. The joint (symmetrized) probability density of the eigenvalues of H can be computed explicitly:

β PN Y β − N V (λj ) pN (λ1, λ2, . . . , λN ) = const. (λi − λj) e 2 j=1 . (4.3) i

1 2 In particular, for the Gaussian case, V (λ) = 2 λ is quadratic and thus the joint distribution of the GOE (β = 1) and GUE (β = 2) eigenvalues is given by

Y β − 1 βN PN λ2 pN (λ1, λ2, . . . , λN ) = const. (λi − λj) e 4 j=1 j . (4.4) i

In particular, the eigenvalues are strongly correlated. (In this section we neglect the ordering of the eigen- values and we will consider symmetrized statistics.) The emergence of the Vandermonde determinant in (4.3) is a result of integrating out the ”angle” variables in (4.2), i.e., the unitary matrix in the diagonalization of H = UΛU ∗. For illustration, we now show this

16 formula for a 2 × 2 matrix. Consider first the real case. By diagonalization, any real symmetric 2 × 2 matrix can be written in the form         x z cos θ − sin θ λ1 0 cos θ sin θ H = = , x, y, z ∈ R. (4.5) z y sin θ cos θ 0 λ2 − sin θ cos θ

Direct calculation shows that the Jacobian of the coordinate transformation from (x, y, z) to (λ1, λ2, θ) is

(λ1 − λ2). (4.6)

The complex case is slightly more complicated. We can write with z = u + iv and x, y ∈ R x z λ 0  H = = eiA 1 e−iA, (4.7) z¯ y 0 λ2

where A is a Hermitian matrix with trace zero, thus it can be written in the form

 a b + ic A = , a, b, c ∈ . (4.8) b − ic −a R

This parametrization of SU(2) with three real degrees of freedom is standard but for our purpose we only need two in (4.7) in addition to the two degrees of freedoms from λ’s. The reason is that the two phases of the eigenvectors are redundant and the trace zero condition only takes out one degree of freedom, leaving one more superfluous parameter. We will see that eventually a plays no role. First we evaluate the Jacobian at A = 0 from the formula (4.7), we only need to keep the leading order in A which gives

x z λ 0  = Λ + i[A, Λ] + O(kAk2), Λ = 1 . (4.9) z¯ y 0 λ2

Thus x z  λ i(b + ic)(λ − λ ) = 1 2 1 + O(kAk2), (4.10) z¯ y −i(b − ic)(λ2 − λ1) λ2

and the Jacobian of the transformation from (x, y, z) to (λ1, λ2, b, c) at b = c = 0 is of the form

1 0 0 0  2 0 1 0 0  2 C(λ2 − λ1)   = C(λ1 − λ2) (4.11) 0 0 i −i 0 0 −1 −1

with some constant C. To compute the Jacobian not at the identity, we first notice that by rotation invariance, the measure factorizes. This means that its density with respect to the Lebesgue measure can be written of the form f(Λ)g(U) with some functions f and g, in fact g(U) is constant (the marginal on the unitary part is the Haar measure). The function f may be computed at any point, in particular at U = I, 2 this was the calculation (4.11) yielding f(Λ) = C(λ1 − λ2) . This proves (4.4) for N = 2, modulo the case of multiple eigenvalue λ1 = λ2 where the parametrization of U is even more redundant. But this set has zero measure, so it is negligible (see [8] for a precise argument). The formula (4.9) is the basis for the proof for general N which we leave it to the readers. The detailed proof can be found in [8] or [37]. It is often useful to think of the measure (4.4) as a Gibbs measure on N ”particles” or ”points” λ = (λ1, λ2, . . . , λN ) of the form

N e−βNH(λ) 1 X 1 X µ (dλ) = p (λ)dλ = , H(λ) := V (λ ) − log |λ − λ | (4.12) N N Z 2 i N j i i=1 i

17 particle, in contrast to the standard statistical physics terminology where the ”Hamiltonian” refers to the total energy. This explains the unusual N factor in the exponent. Notice the emergence of the Vandermonde determinant in (4.3), which directly comes from integrating out the Haar measure and the symmetry type of the ensemble, appears through the exponent β. Only the ”classical” β = 1, 2 or 4 cases correspond to matrix ensembles of the form (4.2), namely, to the real symmetric, complex Hermitian and self-dual matrices. We will not give the precise definition of the latter (see, e.g. Chapter 7 of [105] or [64]), just mention that this is the natural generalization of symmetric or Hermitian matrices to quaternion entries and they have real eigenvalues. We remark that despite the convenience of the explicit formula (4.3) or (4.12) for the joint density, computing various statistics, such as correlation functions or even the density of a single eigenvalue, is a highly nontrivial task. For example, the density involves ”only” integrating out all but one eigenvalue, but the measure is far from being a product, so these integrations cannot be performed directly when N is large. The measure (4.12) has a strong and long range interaction, while conventional methods of statistical physics are well suited for short range interactions. In fact, from this point of view β can be any positive number and does not have to be restricted to the specific values β = 1, 2, 4. For other values of β there is no invariant matrix ensemble behind the measure (4.12), but it is still a very interesting statistical mechanical system, called the log gas or β-ensemble. If the potential V is quadratic, then (4.12) coincides with (4.4) and it is called the Gaussian β-ensemble. We will briefly discuss log gases in Section 18.3.

4.2 Universality of classical invariant ensembles via orthogonal polynomials The classical values β = 1, 2, 4 in the ensemble (4.12) are specific not only because they originate from an invariant matrix ensemble (4.3). For these specific values an extra mathematical structure emerges, namely the orthogonal polynomials with respect to the weight function e−NV (λ) on the real line. Owing to this structure, much more is known about these ensembles than about log gases with general β. This approach was originally applied by Mehta and Gaudin [105, 106] to compute the gap distribution for the Gaussian case that involved the classical Hermite orthonormal polynomials. Dyson [47] computed the local correlation functions for a related ensemble (circular ensemble) that was extended to the standard Gaussian ensembles by Mehta [104]. Later a general method using orthogonal polynomials and the Riemann-Hilbert problem has been developed to tackle a very general class of invariant ensembles (see, e.g. [20, 37, 40–42, 71, 105, 111] and references therein). For simplicity, to illustrate the connection, we will consider the Hermitian case β = 2 with a Gaussian potential V (λ) = λ2/2 (which, by Lemma 4.1, is also a Wigner matrix ensemble, namely the GUE). To simplify the presentation further, for the purpose of this subsection only, we rescale the eigenvalues √ x = Nλ, (4.13) which effectively removes the factor N from the exponent in (4.3). (This simple scaling works only in the pure Gaussian case, and it is only a technical convenience to simplify formulas.) After the rescaling and setting β = 2, the measure we will consider is given by a density which we denote by N Y 2 Y − 1 x2 pbN (x1, x2, . . . , xN ) = const. (xi − xj) e 2 j . (4.14) i

−x2/2 Let Pk(x) be the k-th orthogonal polynomial on R with respect to the weight function e with leading coefficient 1. Let −x2/4 e Pk(x) ψk(x) := 2 −x /4 2 ke PkkL (R) be the corresponding orthonormal function, i.e., Z ψk(x)ψ`(x)dx = δk,`. (4.15) R

18 In the particular case of the Gaussian weight function, Pk is given by the

k 2 d 2 P (x) = H (x) := (−1)kex /2 e−x /2, k k dxk and P (x) 2 ψ (x) = k e−x /4, (4.16) k (2π)1/4(k!)1/2 but for the following discussion we will not need these explicit formulae, only the orthonormality relation (4.15). The key observation is that, by simple properties of the Vandermonde determinant, we have

Y j−1N N ∆N (x) = (xj − xi) = det xi i,j=1 = det Pj−1(xi) i,j=1, (4.17) 16i

2 N h N i Y −x2/2 p (x , . . . , x ) =C det P (x ) e i (4.19) bN 1 N N j−1 i i,j=1 i=1 2 0 h N i 0 N =CN det ψj−1(xi) i,j=1 = CN det KN (xi, xj) i,j=1, N N where in the last step we used that the product of the matrices ψj−1(xi) i,j=1 and ψj−1(xk) j,k=1 is exactly N KN (xi, xk) i,k=1 and we did not follow the precise constants for simplicity. The determinantal structure (4.19) for the joint density functions is remarkable: it allows one to describe a function of N variables in terms of a kernel of two variables only. This structure greatly simplifies the analysis. We now define the correlation functions that play a crucial role to describe universality.

Definition 4.2. Let pN (λ1, λ2, . . . , λN ) be the joint symmetrized of the eigenvalues (without rescaling (4.13)). For any n > 1, the n-point correlation function is defined by Z (n) pN (λ1, λ2, . . . , λn) := pN (λ1, . . . , λn, λn+1, . . . λN )dλn+1 ... dλN . (4.20) N−n R (n) We also define the rescaled correlation function pbN (x1, x2, . . . , xn) in a similar way. Remark. In other sections of this book we usually label the eigenvalues in increasing order so that their probability density, denoted by peN (λ), is defined on the set (N) N Ξ := {λ1 < λ2 < . . . < λN } ⊂ R .

For the purpose of Definition 4.2, however, we dropped this restriction and we consider pN (λ1, λ2, . . . , λN ) N to be a symmetric function of N variables, λ = (λ1, . . . , λN ) on R . The relation between the ordered and (N) unordered densities is clearly peN (λ) = N! pN (λ) · 1(λ ∈ Ξ ). The significance of the correlation functions is that with their help one can compute the expectation value of any symmetrized observable. For example, for any test function O of two variables we have, directly from the definition of the correlation functions, that Z 1 X (2) O(λ , λ ) = O(λ , λ )p (λ , λ )dλ dλ , (4.21) N(N − 1)E i j 1 2 N 1 2 1 2 i6=j R×R

19 where the expectation is w.r.t. the probability density pN or in this case w.r.t. the original random matrix ensemble. Similar formula holds for observables of any number of variables. To compute the correlation functions of a determinantal joint density (4.19), we start with the following prototype calculation for N = 3, n = 2   Z K3(x1, x1) K3(x1, x2) K3(x1, x3) dx3 K3(x2, x1) K3(x2, x2) K3(x2, x3) (4.22) R K3(x3, x1) K3(x3, x2) K3(x3, x3) Z   K3(x2, x1) K3(x2, x2) = dx3 K3(x1, x3) K3(x3, x1) K3(x3, x2) R Z   K3(x1, x1) K3(x1, x2) − dx3 K3(x2, x3) K3(x3, x1) K3(x3, x2) R Z   K3(x1, x1) K3(x1, x2) + dx3 K3(x3, x3). K3(x2, x1) K3(x2, x2) R From the definition (4.18) and the orthonormality of the ψ’s we have the reproducing property Z dyKN (x, y)KN (y, z) = KN (x, z) (4.23) R and the normalization Z dxKN (x, x) = N. (4.24) R Thus (4.22) equals to

K (x , x ) K (x , x ) K (x , x ) K (x , x ) K (x , x ) K (x , x ) 3 2 1 3 2 2 − 3 1 1 3 1 2 + 3 3 1 1 3 1 2 (4.25) K3(x1, x1) K3(x1, x2) K3(x2, x1) K3(x2, x2) K3(x2, x1) K3(x2, x2) K (x , x ) K (x , x ) = 3 1 1 3 1 2 . (4.26) K3(x2, x1) K3(x2, x2)

It is easy to generalize this computation to get

(N − n)! p(n)(x , . . . , x ) = det[K (x , x )]n , (4.27) bN 1 n N! N i j i,j=1 i.e., the correlation functions continue to have a determinantal structure. Here the constant is obtained by (n) the normalization condition that pbN is a probability density. Thus we have an explicit formula for the correlation functions in terms of the kernel KN . We note that this structure is very general and it is not restricted to Hermite polynomials, it only requires a system of orthogonal polynomials. To understand the behavior of KN , first we recall a basic algebraic property of the orthogonal polynomials, the Christoffel–Darboux formula:

N−1 " # X √ ψN (x)ψN−1(y) − ψN (y)ψN−1(x) K (x, y) = ψ (x)ψ (y) = N . (4.28) N j j x − y j=0

Since the Hermite polynomials and the orthonormal functions ψN differ only by an exponential factor (4.16), and these factors in ψ(x)ψ(y) on both side of (4.28) are cancelled, so (4.28) it is just a property of the Hermite polynomials. We now sketch a proof of this identity. Multiplying both side by (x − y), we need to prove that

N−1 X √ h i ψj(x)ψj(y)(x − y) = N ψN (x)ψN−1(y) − ψN (y)ψN−1(x) . (4.29) j=0

20 The left side is a polynomial of degree N in x and degree N in y, up to common exponential factors. The multiplication of x or y can be computed by the following identity p p xψj(x) = j + 1ψj+1(x) + jψj−1(x) (4.30) that directly follows from the “three-term” relation for the Hermite polynomials. Collecting all the terms generated in this way, we obtain the right hand side of (4.29). Details can be found in Lemma 3.2.7 of [8]. It is well known that orthogonal polynomials of high degree have asymptotic behavior (Plancherel-Rotach asymptotics). For the Hermite orthonormal functions ψ these formulas read as follows: (−1)m √ (−1)m √ ψ (x) = √ cos Nx + o(N −1/4), ψ (x) = √ sin Nx + o(N −1/4), (4.31) 2m N 1/4 π 2m+1 N 1/4 π

−1/2 as N → ∞ for any m such that |2m − N| 6 C. The approximation is uniform for |x| 6 CN . We can thus compute that √ √ √ √ √ 1 sin( Nx) cos( Ny) − sin( Ny) cos( Nx) sin N(x − y) K (x, y) ≈ = , (4.32) N π x − y π(x − y) i.e., the celebrated sine-kernel emerged [47, 104]. √ To rewrite this formula into a canonical form, recall that we have done a rescaling (4.13) λ = x/ N (2) where λ is the original variable. The two point function in the original variables was denoted by pN and (2) (2) (2) the rescaled variables by pbN . The relation between pN and pbN is determined by (2) (2) pN (λ1, λ2)dλ1dλ2 = pbN (x1, x2)dx1dx2, (4.33) and thus we have (2) (2)√ √  pN (λ1, λ2) = NpbN Nλ1, Nλ2 . (4.34) Now we introduce another rescaling of the eigenvalues that rescales the typical gap between them to order 1. We set aj 1 λj = ,%sc(0) = , (4.35) %sc(0)N π where %sc is the semicircle density, see (3.1). Using the expression (4.20) for correlation functions, we have, in terms of original variable, that

N N ! 1 (2) a1 a2  Ke11 Ke12 2 pN , = det N N , (4.36) [%sc(0)] %sc(0)N %sc(0)N Ke21 Ke22 where

N 1 1  a1 a2  sin x Ke12 := √ KN √ , √ *S(a1 − a2),S(x) := , (4.37) %sc(0) N − 1 ρsc(0) N ρsc(0) N πx where we used (4.32). Due to the rescaling, this calculation reveals the correlation functions around E = 0. The general formula for any fixed energy E in the bulk, i.e., |E| < 2, can be obtained similarly and it is given by

1 (n) α1 α2 αn  (n) n n pN E + ,E + ,...,E + * qGUE (α) := det S(αi − αj) i,j=1 (4.38) [%sc(E)] N%sc(E) N%sc(E) N%sc(E) as weak convergence of functions in the variables α = (α1, . . . , αn). Note that the limit is universal in a sense that it is independent of the energy E. Formula (4.38) holds for the GUE case. The corresponding expression for GOE is more involved [8, 105]

 0  (n) n S(x) S (x) x qGOE (α) := det K(αi − αj) i,j=1,K(x) := 1 R . (4.39) − 2 sgn(x) + 0 S(t)dt S(x)

21 Here the determinant is understood as the trace of the quaternion determinant after the canonical corre- spondance between a · 1 + b · i + c · j + d · k, a, b, c, d ∈ C, and 2 × 2 complex matrices given by 1 0 i 0   0 1 0 i 1 ↔ i ↔ j ↔ k ↔ . 0 1 0 −i −1 0 i 0 The main technical input is the refined asymptotic formulae (4.31) for orthogonal polynomials. In case of the classical orthogonal polynomials (appearing in the standard Gaussian Wigner and Wishart ensembles) they are usually obtained by a Laplace asymptotics from their integral representation. For a general potential V the corresponding analysis is quite involved and depends on the regularity properties of V . One successful approach was initiated by Fokas, Its and Kitaev [71] and by P. Deift and collaborators via the Riemann- Hilbert method, see [37] and references therein. An alternative method was presented in [98,101] using more direct methods from orthogonal polynomials.

4.2.1 Edge universality: the Airy kernel Near the spectral edges, i.e., for energy E = ±2, a different scaling has to be used. Recall the formula " # √ ψ (x)ψ (y) − ψ (y)ψ (x) K (x, y) = N N N−1 N N−1 N x − y

in the rescaled variables x, y. We will need the following identity of the derivatives of the Hermite functions x √ ψ0 (x) = − ψ (x) + Nψ (x). (4.40) N 2 N N−1 Thus we can rewrite " # ψ (x)ψ0 (y) − ψ (y)ψ0 (x) 1 K (x, y) = N N N N − ψ (x)ψ (y) . (4.41) N x − y 2 N N

Define a new rescaled function  √ u  Ψ (u) := N 1/12ψ 2 N + . (4.42) N N N 1/6 The Plancherel-Rotach edge asymptotics for Ψ asserts that

lim |ΨN (z) − Ai(z)| = 0 (4.43) N→∞

in any compact domain in C, where Ai(x) is the Airy function, i.e., 1 Z ∞ 1  Ai(x) = cos t3 + xt dt. π 0 3 It is well-known that the Airy function is the solution to the second order differential equation, y00 − xy = 0, with vanishing boundary condition at x = ∞. We now define the Airy kernel by Ai(u)Ai0(v) − Ai0(u)Ai(v) A(u, v) := . u − v Under the edge scaling (4.42), we have  √ u √ v  N −1/6K 2 N + , 2 N + → A(u, v). (4.44) N N 1/6 N 1/6 In these variables, we have, by (4.34), that  α α   √ α √ α  p(2) 2 + 1 , 2 + 2 = Np(2) 2 N + 1 , 2 N + 1 . N N 2/3 N 2/3 bN N 1/6 N 1/6

22 By using (4.20), we can continue with

2 (2) α1 α2  1 h  √ αi √ αj i pN 2 + , 2 + = N det KN 2 N + , 2 N + N 2/3 N 2/3 N(N − 1) N 1/6 N 1/6 i,j=1 √ √ 2 −2/3 h −1/6  αi αj i  N det N KN 2 N + , 2 N + , N 1/6 N 1/6 i,j=1 and similar formulas hold for any k-point correlation functions. Using the limiting statement (4.44), in terms of the original variables, we obtain

 α α α  k N k/3p(k) 2 + 1 , 2 + 2 ,..., 2 + k * det A(α , α ) (4.45) N N 2/3 N 2/3 N 2/3 i j i,j=1 in a weak sense. In particular, the last formula with k = 2 implies, for any smooth test function O with compact support, that Z X (2) α1 α2  O(N 2/3(λ − 2)),N 2/3(λ − 2)) = N(N − 1)N −4/3 dα dα O(α , α )p 2 + , 2 + E j k 1 2 1 2 N 2/3 2/3 2 N N j6=k R Z 2 → dα1dα2O(α1, α2) det A(αi, αj) i,j=1. (4.46) 2 R Similar statement holds at the lower spectral edge, E = −2.

23 5 Universality for generalized Wigner matrices 5.1 Different notions of universality The universality of eigenvalue statistics can be considered via several different notions of convergence which yield somewhat different concepts of universality. The local statistics can either be expressed in terms of local correlation functions rescaled around some energy E or one may ask for the gap statistics for a gap λj+1 − λj with a given (typically N-dependent) label j. These will be called fixed energy and fixed gap universality and these two concepts do not coincide. To see this, notice that since eigenvalues fluctuate on a scale much larger than the typical eigenvalue spacing, the label j of the eigenvalue λj closest to a fixed energy E is not a deterministic function of E. Furthermore, one may look for cumulative statistics of gaps averaged over a mesoscopic scale, i.e., both above concepts have a natural averaged version. We now define these four concepts precisely. The correlation functions and the gaps need to be rescaled by the limiting local density, %(E), to get a universal limit. In case of generalized Wigner matrices we have %(E) = %sc(E), but the definitions below hold in more general setup as well. We also remark that all results in this book hold for both real symmetric matrices or complex Hermitian matrices. For simplicity of notations, we formulate all concepts and later state all results in terms of real symmetric matrices. We recall the notation A, B := Z ∩ [A, B] for the set of integers between two real numbers A < B. J K n (i) Fixed energy universality (in the bulk): For any n > 1, F : R → R smooth and compactly supported function and for any κ > 0, we have, uniformly in E ∈ [−2 + κ, 2 − κ], Z   Z 1 (n) α (n) lim n dαF (α)pN E + = dαF (α)qGOE (α) , (5.1) N→∞ %(E) n N%(E) n R R (n) (n) where α = (α1, . . . , αn) . Here pN is the n-point function of the matrix ensemble and qGOE is the (n) limiting n-point function of the GOE defined in (4.39). To shorten the argument of pN , we used the N convention that E + α = (E + α1,...,E + αn) for any α = (α1, . . . , αn) ∈ C . −1+ξ n (ii) Averaged energy universality (in the bulk, on scale N ): For any n > 1, F : R → R smooth and compactly supported function and for some 0 < ξ < 1 and for any κ > 0, we have, uniformly in E ∈ [−2 + κ, 2 − κ], Z E+b Z   Z 1 dx (n) α (n) lim n dαF (α)pN x + dα = dαF (α)qGOE (α) , (5.2) N→∞ %(E) 2b n N%(E) n E−b R R −1+ξ where b = bN := N . (iii) Fixed gap universality (in the bulk): Fix any positive number 0 < α < 1 and an integer n. For any smooth compactly supported function G : Rn → R and for any k, m ∈ αN, (1 − α)N we have J K

µN  lim E G (N%(λk))(λk − λk+1),..., (N%(λk))(λk − λk+n) (5.3) N→∞

(GOE)  − E G (N%(λm))(λm − λm+1),..., (N%(λm))(λm − λm+n) = 0,

where µN denotes the law of the random matrix ensemble under consideration. (iv) Averaged gap universality (in the bulk, on scale N −1+ξ): With the same notations as in (iii) and ` = N ξ with 0 < ξ < 1, we have

1 k+` X µN  lim E G (N%(λj))(λj − λj+1),..., (N%(λj))(λj − λj+n) (5.4) N→∞ 2` + 1 j=k−`

(GOE)  − E G (N%(λm))(λm − λm+1),..., (N%(λm))(λm − λm+n) = 0.

24 Note that in the bulk ` = N ξ consecutive eigenvalues range over a scale N −1+ξ in the spectrum, hence the name.

Although we have formulated the universality notions in terms of large N limits, all limit statements in this book have effective error bounds of the form N −c for some c > 0. For the four notions of universality stated here, the fixed energy (fixed gap, resp.) universality obviously implies averaged energy (averaged gap, resp.) universality. However, the fixed gap universality and the fixed energy universality are not logical consequences of each other. On the other hand, under suitable conditions, the averaged energy universality and averaged gap universality are equivalent. In Section 14 we will prove that the latter implies the former, this is the direction that we actually use in the proof. The opposite direction goes along similar arguments and will be omitted. In this book, we will focus on establishing the averaged energy universality. From this one, the average gap universality follows by the equivalence just mentioned. We now state precisely our representative universality theorem that will be proven in this book. It asserts average energy universality (in the bulk) for generalized Wigner matrices, where the scale of the energy average is reduced to N −1+ξ for arbitrary ξ > 0. For convenience we assume that the normalized matrix entries √ ζij := Nhij (5.5) have a polynomial decay of arbitrary high degree, i.e., for all p ∈ N there is a constant µp such that

p E|ζij| 6 µp (5.6)

−1 for all N, i, and j. Recall from (2.6) that sij  N , so this condition is equivalent to (2.8). We make this assumption in order to streamline notation, but in fact, our results hold, with the same proof, provided (5.6) is valid for some large but fixed p. In fact, even p = 4 + ε is sufficient, see Section 18. On the other hand, if we strengthen the decay condition to uniform subexponential decay (2.7), then certain estimates concerning the local semicircle law (e.g., Theorem 6.7) become stronger, although we will not them in this book (see [70]).

Theorem 5.1. Let√H be an N × N generalized real symmetric Wigner matrix. Suppose that the rescaled matrix elements Nhij satisfy the decay condition (5.6). Then averaged energy universality holds in the sense of (5.1) on scale N −1+ξ for any 0 < ξ < 1.

The rest of this book is devoted to a proof of Theorem 5.1 and related questions on eigenvectors. The proof of this theorem will be based on the following three step strategy.

5.2 The three-step strategy

Step 1. Local semicircle law and delocalization of eigenvectors: It states that the density of eigenvalues is given by the semicircle law not only as a weak limit on macroscopic scales (3.1), but also in a high probability sense with an effective convergence speed and down to short scales containing only N ξ eigenvalues for all ξ > 0. This will imply the rigidity of eigenvalues, i.e., that the eigenvalues are near their classical locations in the sense to be made clear in Theorem 11.5. We also obtain precise estimates on the matrix elements of the Green function which in particular imply complete delocalization of eigenvectors.

Step 2. Universality for√ Gaussian divisible ensembles: The Gaussian divisible ensembles are matrices of −t/2 −t G the form Ht = e H0 + 1 − e H where t > 0 is a parameter, H0 is a (generalized) Wigner matrix and G H is an independent GOE matrix. The parametrization of Ht is chosen so that Ht can be obtained by an Ornstein-Uhlenbeck process starting from H0. More precisely, consider the following matrix Ornstein- Uhlenbeck process 1 1 dHt = √ dBt − Htdt (5.7) N 2

25 with initial data H , where B = {b }N is a symmetric N × N matrix such that its matrix elements 0 √ t ij,t i,j=1 bij,t for i <√ j and bii,t/ 2 are independent standard Brownian motions starting from zero. Then Ht and −t/2 −t G e H0 + 1 − e H have the same distribution. −τ The aim of Step 2 is to prove the bulk universality of Ht for t = N for the entire range of 0 < τ < 1. This is connected to the local ergodicity of the Dyson Brownian motion which we now define.

Definition 5.2. Given a real parameter β > 1, consider the following system of stochastic differential equations (SDE) √   2 λi 1 X 1 dλi = √ dBi + − +  dt , i ∈ 1,N , (5.8) βN 2 N λi − λj j6=i J K

where (Bi) is a collection of real-valued, independent standard Brownian motions. The solution of this SDE is called the Dyson Brownian Motion (DBM).

In a seminal paper [45], Dyson observed that the eigenvalue flow of the matrix OU process is exactly the DBM with β = 1, 2 corresponding to real symmetric or complex Hermitian ensembles. Furthermore, the invariant measure of the DBM is given by the Gaussian β-ensemble defined in (4.4). Dyson further conjectured that the time to “local equilibrium” for DBM is of order 1/N while the time to global equilibrium is of order one. It should be noted that there is no standard notion for the “local equilibrium”; we will instead take a practical point of view to interpret Dyson’s conjecture as that the local statistics of the DBM at any time t  N −1 satisfy the universality defined in earlier in this section. In other words, Dyson’s conjecture −τ is exactly that the local statistics of Ht are universal for t = N for any 0 < τ < 1. Step 3. Approximation by a Gaussian divisible ensemble: It is a simple density argument in the space of matrix ensembles which shows that for any probability distribution of the matrix elements there exists a Gaussian divisible distribution with a small Gaussian component, as in Step 2, such that the two associated Wigner ensembles have asymptotically identical local eigenvalue statistics. The general result to compare any two matrix ensembles with matching moments will be given in Theorem 16.1. Alternatively, to follow the evolution of the Green function under the OU flow, we can use the following continuity of matrix OU process: Step 3a. Continuity of eigenvalues under matrix OU process. In Theorem 15.2 will show that the changes of the local statistics in the bulk under the flow (5.7) up to time scales t  N −1/2 are negligible, see Lemma 15.4. This clearly can be used in combination with Step 2 to complete the proof of Theorem 5.1. The three step strategy outlined here is very general and it has been applied to many different models as we will explain in Section 18. It can also be extended to the edges of the spectrum, yielding the universality at the spectral edges. This will be reviewed in Section 17.

26 6 Local semicircle law for universal Wigner matrices 6.1 Setup We recall the definition of the universal Wigner matrices (Definition 2.1), in particular, the matrix elements may have different distributions but independence (up to symmetry) is always assumed. The fundamental data of this model is the N × N matrix of variances S = (sij), where

2 sij := E |hij| , and we assume that S is (doubly) stochastic: X sij = 1 (6.1) j

for all i. We will assume the polynomial decay analogous to (5.6)

p p/2 E hij 6 µp[sij] , (6.2)

where µp depends only on p and is uniform in i, j and N.  −1 We introduce the parameter M := maxi,j sij that expresses the maximal size of sij:

−1 sij 6 M (6.3)

for all i and j. We regard N as the fundamental parameter and M = MN as a function of N. In this section we do not assume lower bound on sij. Notice that for generalized Wigner matrices M is comparable with N and one may use N everywhere instead of M in all error bounds in this section. However, we wish to keep M as a separate parameter and assume only that

δ N 6 M 6 N (6.4) 1 for some fixed δ > 0. For standard Wigner matrices, hij are identically distributed, hence sij = N and M = N. Another motivating example where M may substantially differ from N is the random band matrices that play a key role interpolating between Wigner matrices and random Schr¨odinger operators (see Section 18.7). Example 6.1. Random band matrices are characterized by translation invariant variances of the form 1 |i − j|  s = f N , (6.5) ij W W where f is a smooth, symmetric probability density on R, W is a large parameter, called the band width, and |i−j|N denotes the periodic distance on the discrete torus T of length N. In this case M is comparable with W which is typically much smaller than N. The generalization in higher spatial dimensions is straightforward, d in this case the rows and columns of H are labelled by a discrete d dimensional torus TL of length L with N = Ld. Recall from (3.1) that the global eigenvalue density of H in the N → ∞ limit follows the celebrated Wigner semicircle law, 1 % (x) := p(4 − x2) , (6.6) sc 2π + and its Stieltjes transform with the spectral parameter z = E + iη is defined by Z % (x) m (z) := sc dx . (6.7) sc x − z R The spectral parameter z will sometimes be omitted from the notation. The two endpoints ±2 of the support of %sc are called the spectral edges.

27 It is well-known (see Section 3.3) that the Stieltjes transform msc is the unique solution of the quadratic equation 1 m(z) + + z = 0 (6.8) m(z) with Im m(z) > 0 for Im z > 0. Thus we have √ −z + z2 − 4 m (z) = , (6.9) sc 2 √ 2 where the square root is chosen so that Im msc(z) > 0 for Im z > 0. In particular, z − 4  z for z large so that it cancels the −z term in the above formula. Under this convention, we collect basic bounds on msc in the following lemma. Lemma 6.2. We have for all z = E + iη with η > 0 that

−1 |msc(z)| = |msc(z) + z| 6 1. (6.10) Furthermore, there is a constant c > 0 such that for E ∈ [−10, 10] and η ∈ (0, 10] we have

c 6 |msc(z)| 6 1 − cη , (6.11) 2 √ |1 − msc(z)|  κ + η , (6.12) as well as (√ κ + η if |E| 6 2 Im msc(z)  (6.13) √ η κ+η if |E| > 2 , where κ := |E| − 2 denotes the distance of E to the spectral edges. Proof. The proof is an elementary exercise using (6.9). We define the Green function or the resolvent of H through

G(z) := (H − z)−1 , and denote its entries by Gij(z). We recall from (3.9) that the Stieltjes transform of the empirical spectral measure 1 X %(dx) = % (dx) := δ(λ − x)dx N N α α

for the eigenvalues λ1 6 λ2 6 ... 6 λN of H is Z %N (dx) 1 1 X 1 m(z) = m (z) := = Tr G(z) = . (6.14) N x − z N N z − λ R α α We remark that every quantity related to the random matrix H, such as the eigenvalues, Green function, empirical and its Stieltjes transform, all depend on N, but this dependence will often be omitted in the notation for brevity. In some formulas, especially in statements of the main results we will put back the N dependence to stress its presence. Since 1 1 1 η Im = θη(E − λα), with θη(x) := 2 2 π z − λα π x + η −1 is an approximation to the identity (i.e., delta function) at the scale η = Im z, we have π Im mN (z) = %N ∗ θη(z), i.e., the imaginary part of mN (z) is the density of the eigenvalues “at the scale η”. Thus the convergence of the Stieltjes transform mN (z) to msc(z) as N → ∞ will show that the empirical local density of the eigenvalues around the energy E in a window of size η converges to the semicircle law %sc(E). Therefore the key task is to control mN (z) for small η.

28 6.2 Spectral information on S

We will show that the elements of the resolvent Gii(z) satisfy a system of self-consistent vector equation of the form N 1 X − ≈ z + s G (z) (6.15) G (z) ij jj ii j=1 with very high probability. This equation will be viewed as a small perturbation of the deterministic equation

N 1 X − = z + s m (z) (6.16) m (z) ij i i j=1

which, under the side condition that Im mi > 0, has a unique solution, namely mi(z) = msc(z) for every i (see (6.8)). Here the stochasticity condition (6.1) is essential. For the stability analysis of (6.16), the 2 invertibility of the operator 1 − msc(z)S plays a key role. Therefore an important parameter of the model is 1 Γ(z) := , Im z > 0. (6.17) 2 1 − msc(z)S ∞→∞

Note that S, being a , satisfies −1 6 S 6 1, and 1 is an eigenvalue with eigenvector e = N −1/2(1, 1,..., 1), Se = e. For convenience we assume that 1 is a simple eigenvalue of S (which holds if S is irreducible and aperiodic). Another important parameter is

1 Γ(z) := , (6.18) e 2 1 − msc(z)S ⊥ e ∞→∞

2 −1 i.e., the norm of (1 − mscS) restricted to the subspace orthogonal to the constants. Recalling that −1 6 S 6 1 and using the upper bound |msc| 6 1 from (6.11), we find that there is a constant c > 0 such that c 6 Γe 6 Γ . (6.19) 2 −1 Since |msc(z)| 6 1 − cη , we can expand (1 − mscS) into a geometric series. Using that kSk`∞→`∞ 6 1 from (6.1), we obtain the trivial upper bound

−1 Γ 6 Cη . (6.20) We remark that we also have the following easy lower bound on Γ: 1 Γ |1 − m2 |−1 . (6.21) > sc > 2

2 −1 This can be seen by applying the matrix (1 − mscS) to the constant vector and using the definition of Γ. 1 For standard Wigner matrices, sij = N , and for bounded energies |E| 6 C, we easily obtain that 1 1 √ Γ(z) = 2  , Γ(e z) = 1, (6.22) |1 − msc(z)| κE + η where κE := min{|E − 2|, |E + 2|} (6.23) is the distance of E = Re z from the spectral edges. To see this, we note that in this case S = Pe, the 2 −1 projection operator onto the direction e. Hence the operator [1 − msc(z)S] can be computed explicitly. The comparison relation in (6.22) follows from (6.12). In the general case, we have the same relations up to a constant factor:

29 Lemma 6.3. For generalized Wigner matrices (Definition 2.1) and for bounded range of energies, say for any |E| 6 10, we have c C C √ 2 6 Γ(z) 6 2  , c 6 Γ(e z) 6 C, (6.24) |1 − msc(z)| |1 − msc(z)| κE + η

where the constant C depends on Cinf and Csup in (2.6) and the constant c in the lower bound is universal. The proof will be given in Section 6.5.

6.3 Stochastic domination The following definition introduces a notion of a high-probability bound that is suited for our purposes. It first appeared in [55] and it relieves us from the burden to keep track of exceptional sets of small probability where some bound does not hold. Definition 6.4 (Stochastic domination). Let

(N) (N) (N) (N) X = X (u): N ∈ N, u ∈ U ,Y = Y (u): N ∈ N, u ∈ U be two families of nonnegative random variables, where U (N) is a possibly N-dependent parameter set. We say that X is stochastically dominated by Y , uniformly in u, if for all (small) ε > 0 and (large) D > 0 we have h (N) ε (N) i −D sup P X (u) > N Y (u) 6 N u∈U (N) for large enough N > N0(ε, D). Unless stated otherwise, throughout this paper the stochastic domination will always be uniform in all parameters apart from the parameter δ in (6.4) and the sequence of constants µp in (5.6); thus, N0(ε, D) also depends on δ and µp. If X is stochastically dominated by Y , uniformly in u, we use the notation X ≺ Y . Moreover, if for some complex family X we have |X| ≺ Y we also write X = O≺(Y ). The following proposition collects some basic properties of the stochastic domination, the proofs are left as an exercise. Proposition 6.5. The relation ≺ satisfies the following properties: i) ≺ is transitive: X ≺ Y and Y ≺ Z imply X ≺ Z;

ii) ≺ satisfies the familiar arithmetic rules of order relations, i.e., if X1 ≺ Y1 and X2 ≺ Y2 then X1 +X2 ≺ Y1 + Y2 and X1X2 ≺ Y1Y2; iii) Moreover, the following cancellation property holds

if X ≺ Y + N −εX for some ε > 0, then X ≺ Y ; (6.25)

−C C iv) Furthermore, if X ≺ Y , EY > N and |X| 6 N almost surely with some fixed exponent C, then for any ε > 0 and sufficiently large N > N0(ε) we have ε EX 6 N EY. (6.26)

Later in Lemma 10.1 the relation (6.26) will be extended to partial expectations. We now define appropriate subsets of the spectral parameter z. Definition 6.6 (Spectral domain). We call an N-dependent family

(N)  −1 D ≡ D ⊂ z : |E| 6 10 ,M 6 η 6 10

a spectral domain. (Recall that M ≡ MN depends on N.)

30 (N) (N) We always consider families X (u) = Xi (z) indexed by u = (z, i), where z takes on values in some spectral domain D, and i takes on values in some finite (possibly N-dependent or empty) index set. The stochastic domination X ≺ Y of such families will always be uniform in z and i, and we usually do not state this explicitly. Usually, which spectral domain D is meant will be clear from the context, in which case we shall not mention it explicitly. For example, using Chebyshev’s inequality and (6.2) one easily finds that

1/2 −1/2 hij ≺ (sij) ≺ M , (6.27)

1/2 uniformly in i and j, so that we may also write hij = O≺((sij) ). The definition of ≺ with the polynomial factors N −ε and N −D are tailored for the assumption (6.2). We remark that if the analogous subexponential decay (2.7) is assumed then a stronger form of stochastic domination can be introduced but we will not pursue this direction here.

6.4 Statement of the local semicircle law The local semicircle law is very sensitive to the closeness of η = Im z to the real axis; when η is too small, the resolvent and its trace becomes strongly fluctuating. So our results will hold only above a certain threshold for η. We now define the lower threshold for η that depends on the energy E ∈ [−10, 10] ( ) 1  M −γ M −2γ  ηE := min η : min , holds for all ξ η . (6.28) e 6 3 4 > Mξ Γ(e E + iξ) Γ(e E + iξ) Im msc(E + iξ)

Although this expression looks complicated, we shall see that it comes out naturally in the analysis of the self-consistent equations for the Green functions. Here γ > 0 is a parameter that can be chosen arbitrarily small; for all practical purposes the reader can neglect it. For generalized Wigner matrices, M  N, from (6.24) we have −1+2γ ηeE 6 CN , (6.29) i.e., we will get the local semicircle law on the smallest possible scale η  N −1, modulo an M γ correction with an arbitrary small exponent. We remark that if we assume subexponential decay (2.7) instead of the polynomial decay (6.2), then the small M γ correction can be replaced with a (log M)C factor. Finally we define our fundamental control parameter s Im m (z) 1 Π(z) := sc + . (6.30) Mη Mη

We can now state the main result of this section, which in this full generality first appeared in [55]. Previous results that have cumulatively led to this general formulation will be summarized at the end of the section. Theorem 6.7 (Local semicircle law [55]). Consider a universal Wigner matrix satisfying the polynomial decay condition (6.2) and (6.3). Then, uniformly in the energy |E| 6 10 and η ∈ [ηeE, 10], we have the bounds s Im msc(z) 1 max Gij(z) − δijmsc(z) ≺ Π(z) = + , z = E + iη, (6.31) i,j Mη Mη as well as 1 mN (z) − msc(z) ≺ . (6.32) Mη Moreover, outside of the spectrum we have the stronger estimate 1 1 √ mN (z) − m(z) ≺ + 2 (6.33) M(κE + η) (Mη) κE + η

31 √ γ uniformly in z ∈ {z : 2 |E| 10 , ηE η 10, Mη κE + η M } for any fixed γ > 0, where 6 6 e 6 6 > κE := |E| − 2 . −1+2γ For generalized Wigner matrix, the threshold ηeE can be chosen ηeE = N . We point out two remarkable features of these bounds. The error term for the resolvent entries behaves −1/2 essentially as (Mη) , with an improvement near the edges where Im msc vanishes. The error bound for the Stieltjes transform, i.e., for the average of the diagonal resolvent entries, is one order better, (Mη)−1, but without improvement near the edge. The resolvent matrix element Gij may be viewed as the scalar product hei,Geji where ei is the i-th coordinate vector. In fact, a more general version of (6.31), the isotropic local law also holds for generalized Wigner matrices:

Theorem 6.8 (Isotropic law [21]). For a generalized Wigner matrix with polynomial decay (5.6) and for any fixed unit vector v, w we have s Im msc(z) 1 hv,G(z)wi − msc(z)hv, wi ≺ + , (6.34) Nη Nη

−1 −1+ω −1 uniformly in the set {z = E + iη : |E| 6 ω ,N 6 η 6 ω } for any fixed ω > 0. Isotropic law was first proven in [88] for Wigner matrices under a vanishing third moment condition. The general case in the form above was given in [21]. We will not prove this result here since it is not needed for the proof of Theorem 5.1. We will first prove a weaker version of Theorem 6.7 in Section 7, the so called weak local semicircle law, where the error term is not optimal. After that, we will prove Theorem 6.7 in Section 8 using Γ instead of Γ.e This yields the same estimate as given in Theorem 6.7 but only on a smaller set of the spectral parameter for which the argument is somewhat simpler. The proof of Theorem 6.7 for the entire domain will only be sketched in Section 9 and we refer the reader to the original paper for the complete version. Section 7 is included mainly for pedagogical reasons to introduce the ideas of continuity argument and self-consistent equations. Section 8 demonstrates how to use the vector self-consistent equations in a simpler setup. In Section 9 we sketch the analysis of the same self-consistent equation but splitting it on space orthogonal to the constant vector and exploiting a spectral gap. Finally Section 10 is devoted to a key technical lemma, the fluctuation averaging lemma.

We close this section with a short summary of the main developments that have gradually led to the local semicircle law Theorem 6.7 in its general form. In Section 8.2 we will demonstrate that all proofs of the local semicircle law rely on some version of a self-consistent equation. At the beginning this was a scalar equation for mN used in various forms by many authors, see e.g. Pastur [109], Girko [76] and Bai [10]. The self-consistent vector equation for vi = Gii − msc (see Section 8.2.2) first appeared in [69]. This allowed one to deviate from the identical distributions for hij and opened up the route to estimates on individual 2 resolvent matrix elements. Finally, the self-consistent matrix equation for E|Gxy| first appeared in [54] and it yielded the diffusion profile for the resolvent. In this book we only need the vector equation. The local semicircle law has three important features that have been gradually achieved. We explain them for the case when M is comparable with N, the general case is only a technical extension. First, the local law in the bulk holds down to the scale expressed by the lower bound η  N −1. This is the smallest −1 possible scale to control m(z), since at scale η . N a few individual eigenvalues strongly influence its behavior and m(z) is not close to a deterministic quantity. This optimal scale in the bulk of the spectrum was first established for Wigner matrices in a series of papers [60–62] with the extension to generalized Wigner matrices in [69]. Second, the optimal speed of convergence in the entrywise bound (6.31) is N −1/2, −1 while in the bound m − msc (6.32) it is N . These optimal N-dependences were first achieved in [68] by introducing the fluctuation averaging mechanism. Third, near the spectral edge the stability of the self- consistent equation deteriorates, manifested by the behavior of Γ(z) in (6.24) when κE is small. This can be compensated by the fact that the density is small near the edge. After many attempts and weaker results

32 in [64, 68, 69], the optimal form of this compensation was eventually found in [70], basically separating the analysis of the self-consistent equation onto the space orthogonal to constants. Since Γe does not deteriorate near the edge, see (6.24), the estimate becomes optimal.

6.5 Appendix: Behaviour of Γ and Γe and the proof of Lemma 6.3

In this appendix we give basic bounds on the parameters Γ and Γ.e Readers interested only in the Wigner −1 case may skip this section entirely; if sij = N , then the explicit formulas in (6.22) already suffice. As it turns out, the behaviour of Γ and Γe is intimately linked with the spectrum of S, more precisely with its spectral gaps. Recall that the spectrum of S lies in [−1, 1], with 1 being a simple eigenvalue.

Definition 6.9. Let δ− be the distance from −1 to the spectrum of S, and δ+ the distance from 1 to the ⊥ spectrum of S restricted to e . In other words, δ± are the largest numbers satisfying

S > −1 + δ−,S e⊥ 6 1 − δ+ .

For generalized Wigner matrices the lower and upper spectral gaps satisfy δ± > a with a constant a := min{Cinf ,Csup}, see (2.6). This simple fact follows easily by splitting

S = (S − aee∗) + aee∗ and noticing that the first term is (1 − a) times a , hence its spectrum lies in [−1 + a, 1 − a]. 2 −1 2 −1 Proof of Lemma 6.3. The lower bound on Γ in (6.24) follows from (1 − mscS) e = (1 − msc) e 2 combined with (6.12). For the upper bound, we first notice that 1 − mscS is invertible since −1 6 S 6 1 and |msc| < 1, see (6.11). Since msc and its reciprocal are bounded (6.11), it is sufficient to bound the inverse −2 of msc − S. Since the spectrum of S lies in the set [−1 + δ−, 1 − δ+] ∪ {1} ⊂ [−1 + a, 1 − a] ∪ {1}, and −2 |msc | > 1, we easily get 1 1 C C C a . (6.35) 2 2 2 6 −2 2 2 6 −2 6 2 1 − mscS ` →` msc − S ` →` min{a, |msc − 1|} |1 − msc|

∞ ∞ 2 In order to find the ` → ` norm, we solve (1 − mscS)v = u directly using (6.10):

2 1/2 kvk∞ = ku + mscSvk∞ 6 kuk∞ + kSk`1→`∞ kvk1 6 kuk∞ + N kSk`1→`∞ kvk2 1/2 1 kuk∞ + N kSk`1→`∞ kuk2 6 2 2 2 1 − mscS ` →`  CaNkSk`1→`∞  6 1 + 2 kuk∞. |1 − msc|

P 1/2 P 21/2 1/2 Here we used (6.35) and that kvk1 = i |vi| 6 N i |vi| and kuk2 6 N kuk∞. Since for generalized Wigner matrices NkSk`1→`∞ = N · maxij sij 6 Csup from (2.6), this proves

1 C 2 6 2 , 1 − mscS `∞→`∞ |1 − msc|

where the constant depends on Cinf and Csup. This proves the upper bound on Γ in (6.24). Finally, we bound Γ;e the lower bound was already given in (6.19). For the upper bound we follow the argument above but we restrict S to e⊥. Since the spectrum of this restriction lies in [−1 + a, 1 − a], we 2 2 −1 immediately get the bound C/a for the ` -norm of (1 − mscS) e⊥ in the right hand side of (6.35). This can be lifted to the same estimate for the `∞ norm. This completes the proof of the lemma.

33 −1 This simple proof of Lemma 6.3 used both spectral gaps and that sij = O(N ). Lacking these infor- mation in the general case, the following proposition gives explicit bounds on Γ and Γe depending on the

spectral gaps δ± in the general case. We recall the notations z = E + iη, κ = κE := |E| − 2 and define

( η κ + √ if |E| 6 2 θ ≡ θ(z) := √ κ+η (6.36) κ + η if |E| > 2 ,

−1 P Proposition 6.10. For the matrix elements of S we assume 0 6 sij = sji 6 M and j sij = 1. Then  there is a universal constant C such that the following holds uniformly in the domain z = E + iη : |E| 6 −1 10,M 6 η 6 10 , and in particular in any spectral domain D. (i) We have the estimate

1 C log N C log N √ 6 Γ(z) 6 2 6 2 . (6.37) C κ + η 1±msc min{η + E , θ} 1 − max± 2

(ii) In the presence of a gap δ− we may improve the upper bound to C log N Γ(z) 6 2 . (6.38) min{δ− + η + E , θ}

(iii) For Γe we have the bounds

−1 C log N C 6 Γ(e z) 6 2 . (6.39) min{δ− + η + E , δ+ + θ}

2 −1 2 −1 Proof. The first bound of (6.37) follows from (1 − mscS) e = (1 − msc) e combined with (6.12). In order to prove the second bound of (6.37), we write 1 1 1 2 = 2 , 1 − m S 2 1+mscS sc 1 − 2 and observe that 2 2 1 + mscS 1 ± msc max =: q . (6.40) 6 ± 2 `2→`2 2 Therefore

n0−1 2 n √ ∞ 2 n 1 X 1 + mscS X 1 + mscS 2 6 + N 1 − mscS ∞ ∞ 2 ∞ ∞ 2 2 2 ` →` n=0 ` →` n=n0 ` →` √ qn0 n + N 6 0 1 − q C log N , 6 1 − q

C0 log N where in the last step we chose n0 = 1−q for large enough C0. Here we used that kSk`∞→`∞ 6 1 and (6.11) to estimate the summands in the first sum. This concludes the proof of the second bound of (6.37). The third bound of (6.37) follows from the elementary estimates

2 2   1 − msc 2 1 + msc 2 η 6 1 − c(η + E ) , 6 1 − c (Im msc) + 6 1 − cθ (6.41) 2 2 Im msc + η for some universal constant c > 0, where in the last step we used Lemma 6.2.

34 The estimate (6.38) follows similarly. Due to the gap δ− in the spectrum of S, we may replace the estimate (6.40) with

2  2  1 + mscS 2 1 + msc 6 max 1 − δ− − η − E , . (6.42) 2 `2→`2 2 Hence (6.38) follows using (6.41). The lower bound of (6.39) was proved in (6.19). The upper bound is proved similarly to (6.38), except that (6.42) is replaced with

2 (  2 ) 1 + mscS 2 1 + msc 6 max 1 − δ− − η − E , min 1 − δ+ , . 2 e⊥ `2→`2 2

This concludes the proof of (6.39).

35 7 Weak local semicircle law

Before we prove the local semicircle law in the strong form Theorem 6.7, for pedagogical reasons we first prove the following weaker version whose proof is easier. For simplicity, in this section we consider the Wigner case, i.e., sij = 1/N and M = N. In the bulk, this weaker estimate is still effective for all η down to the smallest scales N −1+ε, but the power of 1/N in the error estimate is not optimal (1/2 instead of 1 in (6.32)). Near the edge the bound is even weaker; the power of 1/N is reduced to 1/4, indicating that this proof is not sufficiently strong near the edge.

Theorem 7.1 (Weak local semicircle law). Let z = E + iη and κ := |E| − 2 . Let H be a Wigner matrix, let −1 1 G(z) = (H − z) be its resolvent and set mN (z) = N Tr G(z). We assume that the single entry distribution −1+γ satisfies the decay condition (5.6). Choose any γ > 0. Then for z ∈ Dγ := {E + iη : |E| 6 10,N 6 η 6 10} we have n 1 1 o |m (z) − m (z)| ≺ min √ , . (7.1) N sc Nηκ (Nη)1/4 For definiteness, we present the proof for the Hermitian case, but all formulas below carry over to the other symmetry classes with obvious modifications.

7.1 Proof of the weak local semicircle law, Theorem 7.1 The proof is divided into four steps.

7.1.1 Step 1. Schur complement formula. The first step to prove the weak local semicircle law is the to use the Schur complement formula, which we state in the following lemma. The proof is an elementary exercise.

Lemma 7.2 (Schur formula). Let A, B, C be n × n, m × n and m × m matrices. We define (m + n) × (m + n) matrix D as AB∗ D := (7.2) BC and n × n matrix Db as Db := A − B∗C−1B. (7.3)

Then Db is invertible if D is invertible and for any 1 6 i, j 6 n, we have

−1 −1 (D )ij = (Db )ij (7.4) for the corresponding matrix elements.

Recall that Gij = Gij(z) denotes the matrix element of the resolvent

 1  Gij = . H − z ij

(i) Let H be the matrix where all matrix elements of H = (hab) in the i-th column and row are set to be zero:

(i) H ab := hab · 1(a 6= i) · 1(b 6= i), a, b = 1, 2,...,N.

In other words, H(i) is the i-th H[i] of H augmented to an N × N matrix by adding a zero row and column. Recall that the minor H[i] is an (N − 1) × (N − 1) matrix with the i-th row and column removed:

[i] Hab := hab, a, b 6= i.

36 Denote the Green function of H(i) by G(i)(z) = (H(i) − z)−1 which is again an N × N matrix. Notice that

 −1 (−z) if a = b = i (i)  Gab = 0 if exactly one of a or b equals i (7.5)  [i] [i] −1 Gab =( H − z)ab if a 6= i, b 6= i. We warn the reader that in some earlier papers on the subject a different convention was used, where H(i) and G(i) denoted the (N − 1) × (N − 1) minors (H[i] with the current notation) and their Green functions. The current convention simplifies several formulas although the mathematical contents of both versions are identical. With similar conventions, we can define G(ij) etc. The superscript in parenthesis for resolvents always means “after setting the corresponding row and column of H to be zero” (in some terminology, this procedure is also described as ”zero out the corresponding row and column”), in particular, by independence of matrix elements, this means that the matrix G(ij), say, is independent of the i-th and j-th row and column of H. This helps to decouple dependencies in formulae. Let i t a = (h1i, h2i,..., 0, . . . hNi) (7.6) be the i-th column of H, after setting the i-th entry zero. Using Lemma 7.2 for n = 1, m = N − 1, and (7.5), (7.6) we have 1 G = , (7.7) ii i (i) i hii − z − a · G a where i (i) i X (i) X [i] a · G a = hikGkl hli = hikGklhli. (7.8) k,l6=i k,l6=i (We sometimes use u · v instead of u∗v or (u, v) for the usual Hermitian scalar product.) We now introduce some notations: Definition 7.3 (Partial expectation and independence). Let X ≡ X(H) be a . For i ∈ {1,...,N} define the operations Pi and Qi through (i) PiX := E(X|H ) ,QiX := X − PiX.

We call Pi partial expectation in the index i. Moreover, we say that X is independent of a set T ⊂ {1,...,N} if X = PiX for all i ∈ T. We can decompose ai · G(i)ai into its expectation and fluctuation i (i) i  i (i) i a · G a = Pi a · G a + Zi, where  i (i) i Zi := Qi a · G a . (7.9) Since G(i) is independent of ai, so we need to compute expectations and fluctuations of quadratic functions. The expectation is easy

X (i) X (i) 1 X (i) P ai · G(i)ai = P ai G ai = P h G h  = G , i i k kl l i ik kl li N kk k,l k,l6=i k6=i  ¯  1 where in the last step we used that different matrix elements are independent, i.e., Pi hikhil = N δkl. The summations always run over all indices from 1 to N, apart from those that are explicitly excluded. We define

(i) 1 1 X (i) m (z) := Tr G[i](z) = G (z), N N − 1 N − 1 kk k6=i [i] (i) where we used Gkk = Gkk for k 6= i from (7.5). Hence we have the identity 1 1 G = = . (7.10) ii  i (i) i (i) hii − z − Pi a · G a − Zi 1  hii − z − 1 − N mN (z) − Zi

37 7.1.2 Step 2. Interlacing of eigenvalues

(i) We now estimate the difference between msc and mN . The first step is the following well known lemma. We include a short proof for completeness; for simplicity we consider the randomized setup with a continuous distribution to avoid multiple eigenvalues. The general case easily follows from standard approximation arguments. Lemma 7.4 (Interlacing of eigenvalues). Let H be a symmetric or Hermitian N × N matrix with continuous distribution. Decompose H as follows h a∗ H = , (7.11) a B

∗ [1] where a = (h12, . . . , h1N ) and B = H is the (N − 1) × (N − 1) minor of H obtained by removing the first row and first column from H. Denote by µ1 6 µ2 6 ... 6 µN the eigenvalues of H and λ1 6 λ2 6 ... 6 λN−1 the eigenvalues of B. Then with probability one the eigenvalues of B are distinct and the eigenvalues of H and B are interlaced: µ1 < λ1 < µ2 < λ2 < µ3 < ...... µN−1 < λN−1 < µN . (7.12)

Proof. Since matrices with multiple eigenvalues form a lower dimensional submanifold within the set of all Hermitian matrices, the eigenvalues are distinct almost surely. Let µ be one of the eigenvalues of H and let t v = (v1, . . . , vN ) be a normalized eigenvector associated with µ. From the eigenvalue equation Hv = µv and from (7.11) we find that

hv1 + a · w = µv1, and av1 + Bw = µw (7.13)

t with w = (v2, . . . , vN ) . From these equations we obtain

v1 X ξα w = (µ − B)−1av and thus (µ − h)v = a · (µ − B)−1av = , (7.14) 1 1 1 N µ − λ α α using the spectral representation of B, where we set √ 2 ξα = | Na · uα| ,

with uα being the normalized eigenvector of B associated with the eigenvalue λα. From the continuity of the distribution it also that v1 6= 0 almost surely and thus we have

1 X ξα µ − h = , (7.15) N µ − λ α α where ξα’s are strictly positive almost surely (notice that a and uα are independent). In particular, this shows that µ 6= λα for any α. In the open interval µ ∈ (λα−1, λα) the function

1 X ξα Φ(µ) := N µ − λ α α is strictly decreasing from ∞ to −∞, therefore there is exactly one solution to the equation µ − h = Φ(µ). Similar argument shows that there is also exactly one solution below λ1 and above λN−1. This completes the proof.

We can now compare the normalized traces of G and G[i]:

Lemma 7.5. Under the conditions of Lemma 7.4, for any 1 6 i 6 N, we have  1  C m (z) − 1 − m(i)(z) , η = Im z > 0. (7.16) N N N 6 Nη

38 Proof. With the notations of Lemma 7.4, let 1 1 F (x) := #{λ x},F (i)(x) := #{µ x} N j 6 N − 1 j 6 denote the normalized counting functions of the eigenvalues. The interlacing property of the eigenvalues of H and H[i] (see (7.12)) in terms of these functions means that

(i) sup |NF (x) − (N − 1)F (x)| 6 1. x Then, after integrating by parts,

 1  Z dF (x)  1  Z dF (i)(x) m (z) − 1 − m(i)(z) = − 1 − N N N x − z N x − z 1 Z NF (x) − (N − 1)F (i)(x) = dx N (x − z)2 1 Z dx C , (7.17) 6N |x − z|2 6 Nη

1 which proves (7.16). Note that the (7.16) would also hold without the prefactor (1 − N ) since we have the (i) −1 trivial bound |m | 6 η . We can now rewrite (7.10) into the following form, after summing over i:

1 X 1  1  m (z) = , with Ω := h − Z + O . (7.18) N N −z − m (z) + Ω i ii i Nη i N i At this stage we can explain the main idea for the proof of the local semicircle law (7.1). Equation (7.18) is the key self-consistent equation for mN , the Stieltjes transform of the empirical eigenvalue density of H. Notice that if Ωi were zero, then we would have the equation 1 m = − (7.19) m + z complemented with the side condition that Im m > 0. This is exactly the defining equation (6.8) of the Stieltjes transform of the semicircle density. In the next step we will give bounds on hii and Zi, and then we will effectively control the stability of (7.19) against small perturbations. This will give the desired estimate (7.1) on |mN − msc|.

7.1.3 Step 3. Large deviation estimate of the fluctuations.

The quantity Zi defined in (7.9) will be viewed as a random variable in the probability space of the i-th column. We will need an upper bound on it in large deviation sense, i.e., with a very high probability. Since it is a quadratic function of the independent matrix elements of the i-th column, standard large deviation theorems do not directly apply. To focus on the main line of the proof and to keep technicalities at minimum, we postpone the discussion of full version of the quadratic large deviation bounds to Section 7.2. However, in order to get an idea of the size of Zi, we compute its second moment as follows: " # 2 X X  (i)  (i)  (i)  (i)  Pi|Zi| = Pi hikGkl hli − Pi hikGkl hli hik0 Gk0l0 hl0i − Pi hik0 Gk0l0 hl0i . (7.20) k,l6=i k0,l06=i

Since Eh = 0, the non-zero contributions to this sum come from index combinations when all h and h are paired. For pedagogical simplicity, assume that Eh2 = 0, this holds, for example, if the distribution of the real and imaginary parts are the same. Then h factors in the above expression have to be paired in such a

39 0 0 way that hik = hik0 and hil = hil0 , i.e., k = k , l = l . Note that pairing hik = hil would give zero because the expectation is subtracted. The result is

1 X (i) m4 − 1 X (i) |Z |2 = |G |2 + |G |2, (7.21) Pi i N 2 kl N 2 kk k,l6=i k6=i √ 4 where m4 = E| Nh| is the fourth moment of the single entry distribution. The first term can be computed

1 X (i) 1 X [i] 1 X 1 1 X [i] 1  1  (i) |G |2 = |G |2 = (|G[i]|2) = Im G = 1 − Im m , (7.22) N 2 kl N 2 kl N 2 kk Nη N kk Nη N N k,l6=i k,l6=i k6=i k6=i where |G|2 = GG∗. In the middle step we have used the Ward identity valid for the resolvent R(z) = (A−z)−1 of any Hermitian matrix A: 1 1 |R(z)|2 = = Im R(z), z = E + iη. (7.23) |A − E|2 + η2 η We can estimate the second term in (7.21) by the general fact that the resolvent R = (A − z)−1 of any Hermitian matrix A, we have X 2 X 1 |Rkk| 6 2 , (7.24) |ζα − z| k α where ζα are the eigenvalues of A. To see this, let uα be the normalized eigenvectors. Then by spectral theorem 2 X |uα(k)| R = , kk ζ − z α α and thus we have 2 2 X 2 X X |uα(k)| |uβ(k)| |Rkk| 6 |ζα − z| |ζβ − z| k k α,β

X 1 X 2 X 2 X 1 6 2 |uα(k)| |uβ(k)| = 2 , |ζα − z| |ζα − z| α k β α

where we have used the Schwarz inequality and that {uβ} is an orthonormal basis. Applying this bound to [i] the Green function of H with eigenvalues µα, we have

N−1 1 X (i) 2 1 X [i] 2 1 1 X η 1  1  (i) 2 |Gkk| = 2 |Gkk| 6 2 = 1 − Im mN . (7.25) N N Nη N |µα − z| Nη N k6=i k6=i α=1

(i) By (7.16), we can estimate mN by mN . Thus the estimates (7.22) and (7.25) confirm that the size of Zi is roughly 1 p |Z | √ Im m (7.26) i . Nη N in the second moment sense. In Section 7.2 we will prove that this inequality actually holds in large deviation sense, i.e., we have 1 p |Z | ≺ √ Im m . (7.27) i Nη N

The diagonal entry hii can be easily estimated. Since the single entry distribution has finite moments (5.6), we have  ε −1/2 −εp P |hii| > N N 6 CpN

for each i and for any ε > 0. Hence we can guarantee that all diagonal elements hii simultaneously satisfy −1/2 |hii| ≺ N .

40 7.1.4 Step 4. Initial estimate at the large scales To control the error terms in the self-consistent equation (7.18) we need two inputs. First, from now on we assume that (7.27) holds. Since Nη is large, this implies that Zi is small provided that Im mN is bounded. Second, we need to ensure that the denominator in the right hand side of (7.18) does not become too small. Since the main term in this denominator is z + mN (z), our task is to show that 1 ≺ 1. (7.28) |z + mN (z)|

Notice that both inputs are in terms of the yet uncontrolled quantity mN ; they would be trivially available if mN were replaced with msc (see Lemma 6.2). Since the smallness of mN − msc is our goal, to break this apparently circular argument we will use a bootstrap strategy. The convenient bootstrap parameter is η. We first establish the result for large η in this section which is called the initial estimate. Then, in the next section, step by step we reduce the value η by using the control from the previous scale to estimate −1 Im mN and |z + mN (z)| . This control will use the large deviation bounds on Zi which hold with very high probability. Hence at each step an exceptional event of very small probability will have to be excluded. This is the main reason why the bootstrap argument is done in small discrete steps, although in essence this is a continuity argument. We will explain this important argument in more details in the next section. Both the initial estimate and the continuity argument use the following simple idea. By expanding Ωi in −1/2 the denominator in (7.18) and using (7.27) and hii ≺ N , we have 1 1

mN (z) + ≺ 2 max |Ωi| z + mN (z) |z + mN (z)| i s 1 h Im mN −1/2 −1i ≺ 2 + N + (Nη) (7.29) |z + mN (z)| Nη provided that Ωi can be considered as a small perturbation of z + mN (z), i.e. 1 max |Ωi|  1, (7.30) |z + mN (z)| i so that the expansion is justified. We can now use the following elementary lemma.

Lemma 7.6. Fix z = E + iη, with |E| 6 20, 0 < η 6 10, and set κ := |E| − 2 . Suppose that m satisfies the inequality 1 m + δ (7.31) z + m 6 for some δ 6 1. Then n 1 o Cδ √ min |m − msc(z)|, m − 6 √ 6 C δ. (7.32) msc(z) κ + η + δ

Proof. For δ 6 1 and |z| 6 20, (7.31) implies that |m| 6 22. Write (7.31) as 1 m + =: ∆, |∆| δ, z + m 6

−1 and subtract this from the equation msc + (z + msc) = 0. After some simple algebra we get h 1 i (m − msc) m − = ∆(m + z), (7.33) msc i.e., 1 |m − msc| m − 6 Cδ, (7.34) msc

41 for some fixed constant C, where we have used√ (6.11). 2 0 0 We separate two cases. If |1 − msc| 6 C δ for some large C , then we write the above inequality as 2 1 − msc |m − msc| m − msc − 6 Cδ. (7.35) msc √ √ 0 0 We claim that this implies |m − msc| 6 2C δ/|√msc|. Indeed, if |m − msc| > 2C δ/|msc| were true, 0 then the second factor in (7.35) were at least C δ/|msc|, so the left hand side of (7.35) were at least 0 2 0 2(C /|msc|) δ. Since√ msc  1,√ see (6.11), we would get a contradiction if C is large enough. Thus we proved 0 |m − msc| 6 2C δ/|msc| 6 √C δ in this case, which is√ in agreement with the first inequality in (7.32) since 2 0 √ the condition |1 − msc| 6 C δ also implies κ√+ η . δ by (6.12).√ 2 0 √ In the second case we have |1 − msc| > C δ, i.e., κ + η & δ, so for (7.32) we need to prove that √ −1 √ 1 2 |m − msc| . δ/ κ + η or |m − msc | . δ/ κ + η. If |m − msc| 6 2 |1 − msc|/|msc|, then the second factor 1 2 √ √ in (7.35) were at least 2 |1 − msc|/|msc|  κ + η, so we would immediately get |m − msc| . δ/ κ + η. We 1 2 √ are left with the case |m − msc| > 2 |1 − msc|/|msc|, i.e., |m − msc| & κ + η. Rewrite (7.35) as

2 1 1 − msc 1 m − + m − 6 Cδ, (7.36) msc msc msc and repeat the previous argument with interchanging the role of m and 1/m . We conclude that either sc sc √ |m − m−1| 1 |1 − m2 |/|m |, in which case we immediately get |m − m−1| δ/ κ + η, or |m − m−1| √ sc 6 2 sc sc √ sc . sc & κ + η. In the latter case, combining it with |m − msc| & κ + η from the previous argument, we would −1 2 0 0 get |m − msc||m − msc | & κ + η. Since κ + η & |1 − msc| > C δ, this would contradict (7.34) if C is large enough. This completes the proof of the lemma. Applying Lemma 7.6 in (7.29), we see that

n 1 o 1 n 1/2 maxi |Ωi|o min |mN (z) − msc(z)|, mN (z) − ≺ min max |Ωi| , √ , (7.37) msc(z) |z + mN (z)| i κ i.e., a good bound on Ωi directly yields an estimate on mN − msc provided a lower bound on |z + mN (z)| is −1 given and if the possibility that |mN (z) − msc (z)| is small can be excluded. −1/16 Using this scheme, we now derive the initial estimate, which is (7.1) for any η > N . To check (7.30), −1 −15/32 −1/16 we start with the trivial bound Im mN 6 η . By (7.27), we have Zi ≺ N for η > N . Hence we have −15/16 −15/32 max |Ωi| max |Zi| + |hii| + CN ≺ N . (7.38) i 6 i −1/16 −1/16 Since Im(z + mN (z)) > η > N , (7.30) is satisfied for η > N . Thus (7.29) implies that 1 1 −15/32 −11/32 mN (z) + ≺ 2 O(N ) ≺ N (7.39) z + mN (z) |z + mN (z)|

for any η N −1/16. The precise exponents do not matter here, our goal is to find some η = N −c, such that > 0 the last equation holds with an error N −c for some c, c0 > 0. −1/16 Applying Lemma 7.6 we obtain, for any η > N , that either

−11/64 1 −11/64 |mN (z) − msc(z)| ≺ N , or mN (z) − ≺ N . (7.40) msc(z)

However, the second option is excluded since Im mN > 0 and thus

1 1 −1/16 mN (z) − > Im mN − Im > c Im msc(z) > cη > cN , (7.41) msc(z) msc(z) where we used (6.13). This proves that

−11/64 |mN (z) − msc(z)| ≺ N . (7.42)

42 Once we know that |mN (z) − msc(z)| is small, we can simply perturb the relation 1 = |msc(z)| 6 1 z + msc(z)

from Lemma 6.2 to obtain 1 |mN (z)| 6 C, and 6 C. (7.43) z + mN (z)

−1/2 Thus we can bound Ωi from (7.27) and hii ≺ N by 1 max |Ωi| ≺ √ , (7.44) i Nη and, instead of (7.39), we get the stronger bound

1 1 mN (z) + ≺ √ . z + mN (z) Nη

−1/2 Applying (7.32) once again, with δ = (Nη) , and using (7.41) to exclude the possibility that mN is close to 1/msc, we obtain the better bound  1 1  |m (z) − m (z)| ≺ min , √ , for any η N −1/16, (7.45) N sc (Nη)1/4 Nηκ >

−1/16 which is exactly (7.1) for η > N .

7.1.5 Step 5. Continuity argument and completion of the proof

−1/16 With (7.1) proven for any η > N , we now proceed to reduce the scale of η, while E, the real part of z, −4 −1 is kept fixed. To do this, choose another scale η1 = η − N slightly smaller than η (but still η1 > N ). Since X 1 |m0 (z)| N −1 η−2, (7.46) N 6 |λ − z|2 6 α α we have the deterministic relation

−2 |mN (z1) − mN (z)| 6 N , z = E + iη, z1 = E + iη1. (7.47)

−2 Hence (7.1) holds as well for η1 instead of η since N is smaller than all our error terms. But we cannot rely on this idea for the obvious reason that we would need to apply this procedure at least N 4 times and the many small errors would accumulate. We need to show that the estimate in (7.1) does not deteriorate even by the small amount N −2. Since the definition of ≺ includes additional factors N ε, it is not sufficiently sensitive to track such . For the continuity argument we will now abandon the formalism of stochastic domination and we go back to tracking exceptional sets precisely. In Section 8 we will setup a continuity scheme fully within the framework of stochastic domination, but here we complete the proof in a more elementary way. γ The idea is to show that for any small exponent ε ∈ (0, 10 ) there is a constant C1 large enough such that if n o  1 1  S(z) := min |m (z) − m (z)|, |m (z) − m−1(z)| C N ε min , √ (7.48) N sc N sc 6 1 (Nη)1/4 Nηκ in a set A in the probability space, then we not only have the bound

 1 1  S(z ) C N ε min , √ + 2N −2 (7.49) 1 6 1 (Nη)1/4 Nηκ

43 that trivially follows from (7.47), but we also have that

 1 1  S(z ) C N ε min , √ (7.50) 1 6 1 1/4 (Nη1) Nη1κ holds with the same C1 and ε in a set A1 ⊂ A with

−D P (A \ A1) 6 N for any D > 0. (7.51) Thus the deterioration of the estimate from (7.47) to (7.50) can be avoided at least with a very high −1+5ε probability. To see this, we note that in the regime η > N the bound (7.49) implies that in the set A we have | Im mN (z1)| 6 |mN (z1)| 6 |msc(z1)| + o(1) 6 2, (7.52) and 1 6 2. (7.53) |z1 + mN (z1)| −1 −1 −1 The last estimate is a perturbative consequence of the bounds |z + msc(z)| 6 1 and |z + msc (z)| 6 C, where we used either

|(z1 + mN (z1)) − (z + msc(z))| 6 |mN (z1) − mN (z)| + |z1 − z| + |mN (z) − msc(z)| = o(1) (7.54) or −1 −1 |(z1 + mN (z1)) − (z + msc (z))| 6 |mN (z1) − mN (z)| + |z1 − z| + |mN (z) − msc (z)| = o(1);

one of which follows from (7.47) and (7.48). Let A1 be the intersection of three sets: A, the set on which − 1 +ε |hii| 6 N 2 holds and the set that (7.27) holds at the scale η1, i.e., at z = E + iη1. Hence, together with (7.53), the condition (7.30) holds in A1. Now we can use (7.37) together with (7.52) and (7.53) to prove that (7.50) holds if C1 is chosen large enough. At each step, we lose a set of probability N −D due to checking (7.27) at that scale. The estimate from the previous scale is used only to check (7.52), (7.53) and (7.30). Finally, the constant C1 in the final bound of S(z1) comes from the constant in (7.32) which is uniform in η. Therefore C1 does not deteriorate when passing to the next scale. The only price to pay is the loss of an exceptional set of probability N −D. Since the number of steps are N 4, while D can be arbitrary large, this loss is affordable. Since the exponent ε > 0 was arbitrary, this proves n o n 1 1 o S(z) = min |m (z) − m (z)|, |m (z) − m−1(z)| ≺ min √ , (7.55) N sc N sc Nηκ (Nη)1/4

−4 for any fixed z ∈ Dγ . Using this relation for a discrete net Db γ := Dγ ∩ N (Z + iZ) and a union bound, we obtain that (7.55) holds simultaneously for all z ∈ Db γ . Using the Lipschitz continuity of S(z) (with a 2 Lipschitz constant at most N ), we get (7.55) simultaneously for all z ∈ Dγ . Finally, we need to show that the estimate holds not only for the minimum, but for |mN (z) − msc(z)|. We have already seen this for any −1/16 η > N in (7.45). Fixing E = Re z, we consider mN (E +iη), msc(E +iη) and S(E +iη) as functions of η. Since these are continuous functions, by reducing η we obtain that mN (E + iη) remains close to msc(E + iη) as long as the right hand side (7.55) is larger than the difference

−1 √ msc(E + iη) − msc (E + iη)  κE + η. √ Since the right hand side of (7.55) increases as η decreases, while the separation bound κ + η decreases, √ E once η is so small that the separation bound κE + η becomes smaller than the right hand side of (7.55), −1 the difference between msc and msc remains irrelevant for any smaller η. Thus (7.1) holds on the entire Dγ . This proves the weak local semicircle law, i.e., Theorem 7.1 except the large deviation estimate (7.27) which we will prove in the next subsection.

44 7.2 Large deviation estimates

Finally, in order to estimate large sums of independent random variables as in (7.8) and later in (8.2), we will need a large deviation estimate for linear and quadratic functionals of independent random variables. The case of the linear functionals is standard. Quadratic functionals were considered in [62,69]; the current formulation is taken from [52].

(N) (N) Theorem 7.7 (Large deviation bounds). Let Xi and Yi be independent families of random variables (N) (N) (N) and aij and bi be deterministic; here N ∈ N and i, j = 1,...,N. Suppose that all entries Xi and (N) Yi are independent and satisfy

2 p 1/p EX = 0 , E|X| = 1 , kXkp := (E|X| ) 6 µp (7.56) for all p ∈ N and some constants µp. Then we have the bounds

 1/2 X X 2 biXi ≺ |bi| , (7.57) i i  1/2 X X 2 aijXiYj ≺ |aij| , (7.58) i,j i,j  1/2 X X 2 aijXiXj ≺ |aij| . (7.59) i6=j i6=j

Our proof in fact generalizes trivially to arbitrary multilinear estimates for quantities of the form P∗ a (u)X (u) ··· X (u), where the star indicates that the summation indices are constrained i1,...,ik i1...ik i1 ik to be distinct. For our purposes the most important inequality is the last one (7.59). It confirms the intuition from P Section 7.1.3 that for sums of the form i6=j aijXiXj the second moment calculation gives the correct order of magnitude and it can be improved to a high probability statement in large deviation sense. Indeed, the second moment calculation gives

2 X X X X  2  X 2 0 0 ¯ 0 ¯ 0 E aijXiXj = aija¯i j XiXjXi Xj = |aij| + aija¯ji 6 2 |aij| , i6=j i6=j i06=j0 i6=j i6=j since only the pairings i = i0, j = j0 or i = j0, j = i0 give nonzero contribution. To prepare the proof of Theorem 7.7, we first recall the following version of the Marcinkiewicz-Zygmund inequality.

Lemma 7.8. Let X1,...,XN be a family of independent random variables each satisfying (7.56) and suppose that the family (bi) is deterministic. Then

 1/2 X 1/2 X 2 biXi (Cp) µp |bi| . (7.60) 6 i p i

2 P 2 Proof. The proof is a simple application of Jensen’s inequality. Writing B := j|bi| , we get, by the

45 classical Marcinkiewicz-Zygmund inequality [126] in the first line, that

p  1/2 p X p/2 X 2 2 biXi (Cp) |bi| |Xi| 6 i p i p  2 p/2 X |bi| = (Cp)p/2Bp |X |2 E B2 i i  2  X |bi| (Cp)p/2Bp |X |p 6 E B2 i i p/2 p p 6 (Cp) B µp .

Next, we prove the following intermediate result.

Lemma 7.9. Let X1,...,XN ,Y1,...,YN be independent random variables each satisfying (7.56), and suppose that the family (aij) is deterministic. Then for all p > 2 we have  1/2 X 2 X 2 aijXiYj Cp µ |aij| . 6 p i,j p i,j Proof. Write X X X aijXiYj = bjYj , bj := aijXi . i,j j i

Note that (bj) and (Yj) are independent families. By conditioning on the family (bj), we therefore get from Lemma 7.8 and the triangle inequality that 1/2  1/2 X 1/2 X 2 1/2 X 2 bjYj (Cp) µp |bj| (Cp) µp kbjk . 6 6 p j p j p/2 j Using Lemma 7.8 again, we have  1/2 1/2 X 2 kbjkp 6 (Cp) µp |aij| . i This concludes the proof.

Lemma 7.10. Let X1,...,XN be independent random variables each satisfying (7.56), and suppose that the family (aij) is deterministic. Then we have  1/2 X 2 X 2 aijXiXj Cp µ |aij| . 6 p i6=j p i6=j

Proof. The proof relies on the identity (valid for i 6= j) 1 X 1 = 1(i ∈ I)1(j ∈ J) , (7.61) ZN ItJ=NN

N−2 where the sum ranges over all partitions of NN = {1,...,N} into two sets I and J, and ZN := 2 is independent of i and j. Moreover, we have X 1 = 2N − 2 , (7.62)

ItJ=NN

46 where the sum ranges over nonempty subsets I and J. Now we may estimate

 1/2 X 1 X X X 1 X 2 X 2 aijXiXj 6 aijXiXj 6 Cp µp |aij| , p ZN p ZN i6=j ItJ=NN i∈I j∈J ItJ=NN i6=j

where we used that, for any partition I t J = NN , the families (Xi)i∈I and (Xj)j∈J are independent, and hence the Lemma 7.9 is applicable. The claim now follows from (7.62). As remarked above, the proof of Lemma 7.10 may be easily extended to multilinear expressions of the form P∗ a X ··· X . i1,...,ik i1...ik i1 ik We may now complete the proof of Theorem 7.7. Proof of Theorem 7.7. The proof is a simple application of Chebyshev’s inequality. Part (i) follows from Lemma 7.8, part (ii) from Lemma 7.10, and part (iii) from Lemma 7.9. We give the details for part (iii). For ε > 0 and D > 0 we have " # "  1/2 # X ε X ε X 2 ε/2 P aijXiXj N Ψ P aijXiXj N Ψ , |aij| N Ψ > 6 > 6 i6=j i6=j i6=j " 1/2 # X 2 ε/2 + P |aij| > N Ψ i6=j "  1/2# X ε/2 X 2 −D−1 P aijXiXj N |aij| + N 6 > i6=j i6=j Cpµ2 p p + N −D−1 6 N ε/2

P 21/2 for arbitrary D. In the second step we used the definition of i6=j |aij| ≺ Ψ with parameters ε/2 and D + 1. In the last step we used Lemma 7.10 by conditioning on (aij). Given ε and D, there is a large enough p such that the first term on the last line is bounded by N −D−1. Since ε and D were arbitrary, the proof is complete. The claimed uniformity in u in the case that aij and Xi depend on an index u also follows from the above estimate.

47 8 Proof of the local semicircle law

In this section we start the proof of the local semicircle law, Theorem 6.7. This section can be read indepen- dently of the previous Section 7, so some basic definitions and facts are repeated for convenience. For those readers who may wish to compare this argument with the proof the weak law, Theorem 7.1, we mention that the basic strategy is similar except that we use an additional mechanism that we call fluctuation averaging. We point out that the first estimate in (7.29) was not optimal; here we estimated the average of Ωi by its maximum. Since Ωi’s are almost centred random variables with weak correlation, a more precise estimate of their average leads to a considerable improvement. We now recall that the basic steps to prove the weak law were

1) self consistent equation for mN ;

(i) 2) interlacing of eigenvalues to compare mN and mN ;

3) quadratic large deviation estimate for the error term Zi; 4) initial estimate for large η;

5) extend the estimate to small η by the continuity argument.

Exploiting the fluctuation averaging mechanism requires a control on the individual matrix elements of −1 the resolvent G instead of just its normalized trace, mN = N Tr G. Therefore, instead of considering the scalar equation for mN , we will consider the vector self-consistent equation for the diagonal elements Gii of the Green function and investigate the stability of this equation. There is no direct analogue of the interlacing property for Gii, thus we introduce new resolvent decoupling identities to compare the resolvents of the original matrix and its minors. We will still use the quadratic large deviation estimate to bound the error term. We also use a continuity argument similar to the one given in the weak law to derive a crude estimate on Gii (formulated in terms of a certain dichotomy), but instead of making many small steps and keeping track of the exceptional sets, we follow a genuinely continuous approach within the framework of the stochastic domination. Finally, we use the fluctuation averaging lemma (Lemma 8.9) to boost the error estimate by one order and an iteration argument to prove the local semicircle law, Theorem 6.7. We now start the rigorous proof. We will largely follow the presentation in [55].

8.1 Tools In this subsection we collect some basic definitions and facts. First we repeat the Definition 7.3 of the partial expectation:

Definition 8.1 (Partial expectation and independence). Let X ≡ X(H) be a random variable. For i ∈ {1,...,N} define the operations Pi and Qi through

(i) PiX := E(X|H ) ,QiX := X − PiX.

We call Pi partial expectation in the index i. Moreover, we say that X is independent of a set T ⊂ {1,...,N} if X = PiX for all i ∈ T. Next, we define matrices with certain columns and rows zeroed out.

Definition 8.2. For T ⊂ {1,...,N} we set H(T) to be the N × N matrix defined by

( ) (H T )ij := 1(i∈ / T)1(j∈ / T)hij , i, j = 1, 2,...,N.

Moreover, we define the resolvent of H(T) and its normalized trace through 1 G(T)(z) := (H(T) − z)−1 , m(T)(z) := Tr G(T)(z). ij ij N

48 We also set the notation ( ) XT X := .

i i : i/∈T These definitions are the natural generalizations of H(i) and G(i) introduced in Section 7.1.1. In particular, notice that H(T) is the matrix obtained by setting all rows and columns in T to be zero. This is different from considering the minors by removing the columns and rows in T. Similarly, G(T) is still an N ×N matrix (T) −1 (T) ({i}) (i) with Gii = −z for i ∈ T and Gij = 0 if i ∈ T and j 6∈ T. We will denote G simply by G and similarly for a few more indices. This is consistent with the notations we used in the earlier chapters. The following resolvent decoupling identities form the backbone of all of our calculations. The idea behind them is that a resolvent matrix element Gij depends strongly on the i-th and j-th columns of H, but weakly on all other columns. The first identity determines how to make a resolvent matrix element Gij independent of an additional index k 6= i, j. The second identity expresses the dependence of a resolvent matrix element Gij on the matrix elements in the i-th or in the j-th column of H. We added a third identity that relates sums of off-diagonal resolvent entries with a diagonal one. The proofs are elementary.

Lemma 8.3 (Resolvent decoupling identities). For any real or complex Hermitian matrix H and T ⊂ {1,...,N} the following identities hold.

i) First resolvent decoupling identity [69]: If i, j, k∈ / T and i, j 6= k then G(T)G(T) (T) (Tk) ik kj Gij = Gij + . (8.1) (T) Gkk

ii) Second resolvent decoupling identity [52]: If i, j∈ / T satisfy i 6= j then

(Ti) (Tj) (T) (T) X (Ti) (T) X (Tj) Gij = −Gii hikGkj = −Gjj Gik hkj , (8.2) k k where the superscript in the summation means omission, e.g., the summation in the first sum runs over all k∈ / T ∪ {i}. iii) [Ward identity] For any T ⊂ {1,...,N} we have

X ( ) 1 ( ) |G T |2 = Im G T . (8.3) ij η ii j

Proof. We will prove this lemma only for T= 0, the general case is a straightforward modification. We first consider the (8.2). Recall the resolvent expansion stating that for any two matrices A and B, 1 1 1 1 1 1 1 = − B = − B , (8.4) A + B A A + B A A A A + B provided that all the matrix inverses exist. To obtain the first formula in (8.2), we use the first resolvent identity (8.4) at the (ij)-th matrix element (i) (i) (i) −1 with A = H − z and B = H − H . Since Gij = (A )ij = 0 if i 6= j, we immediately have

X (i) X (i) X (i) Gij = − GiihikGkj − GikhkiGij = −Gii hikGkj , i 6= j. (8.5) k k k6=i The second identity in (8.2) follows in the same way by using the second identity in (8.4). To prove the identity (8.1), we let A = H(k) − z and B = H − H(k). Then from the first formula in (8.4) we have

(k) X (k) (k) GikGkj Gij = Gij − Gik hk`G`j = Gij + , (8.6) Gkk `6=k

49 and in the second step we used (8.2). Finally, the Ward identity (8.3), already used in (7.23), is well known. It follows from the spectral decomposition for H with eigenvalues λα and eigenvectors uα, namely

2 2 X X X |uα(i)| 1 X |uα(i)| 1 |G |2 = G G∗ = [|G|2] = = Im = Im G . ij ij ji ii |λ − z|2 η λ − z η ii j j α α α α

8.2 Self-consistent equations on two levels Using the notation from the previous section, the Schur formula (Lemma 7.2) can be written as

(Ti) 1 X (Ti) = hii − z − hikG hli , (8.7) (T) kl Gii k,l

where i∈ / T ⊂ {1,...,N}. We can take T = ∅ to have 1 G = . (8.8) ii P(i) (i) hii − z − k,l hikGkl hli The partial expectation with respect to the index i in the denominator can be computed explicitly since (i) hikhli is independent of Gkl and its expectation is nonzero only if k = l. We thus get

(i) (i) (i) (i) h X (i) i X (i) X X GikGki X X GikGki Pi hikGkl hli = sikGkk = sikGkk − sik = sikGkk − sik , Gii Gii k,l k k k k k

where in the second step we used (8.1). We will compare (8.8) with the defining equation of msc: 1 msc = , (8.9) −z − msc so we introduce the notation for the difference

vi := Gii − msc .

Recalling (6.1), we get the following system of self-consistent equations for vi 1 vi = P  − msc , (8.10) −z − msc − k sikvk − Υi where

(i) X GikGki h X (i) i Υi := Ai + hii − Zi ,Ai := sik ,Zi := Qi hikGkl hli . (8.11) Gii k k,l

All these quantities depend on z, but we omit it in the notation. We will show that Υ is a small error term. This is clear about hii by (6.27). The term Ai will be small since off-diagonal resolvent entries are small. Finally, Zi will be small by a large deviation estimate (7.59) from Theorem 7.7. Before we present more details, we heuristically show the power of this new system of self-consistent equations (8.10) and compare it with the single self-consistent equation used in Section 7 and, in fact, used in all previous literature on the resolvent method for Wigner matrices.

50 8.2.1 A scalar self-consistent equation Introduce the notation 1 X [a] = a (8.12) N i i N for the average of a vector (ai)i=1. Consider the standard Wigner case, sij = 1/N. Then X 1 X s v = v = [v] = m − m . ik k N k N sc k k

Neglecting Υi in (8.10) and taking the average of this relation for each i = 1, 2,...,N, we get 1 [v] ≈ − msc . (8.13) −z − msc − [v]

Recall the (8.9), the defining equation of msc: 1 msc = . −z − msc Using Lemma 7.6, which asserts that this equation is stable under small perturbations, at least away from the spectral edges z = ±2, we conclude [v] ≈ 0 from (8.13). This means that mN ≈ msc and hence for the empirical density %N ≈ %sc, i.e., we obtained the Wigner’s original semicircle law. This is exactly the idea we used in Section 7 except that the interlacing argument is replaced by Lemma 8.3 this time.

8.2.2 A vector self-consistent equation 1 If we are interested in individual resolvent matrix elements Gii instead of their average, N Tr G, then the scalar equation (8.13) discussed in the previous section is not sufficient. We have to consider (8.10) as a system of equations for the components of the vector v = (v1, . . . , vN ). In order to analyse it, we will linearize this system of equations. There are two possible linearizations depending on how we expand the error terms. P  Linearization I. If we know that k sikvk − Υi is small, we can expand the denominator in (8.10) to have

2 X  3 X 2 X 3 vi = msc sikvk − Υi + msc sikvk − Υi + O sikvk − Υi , (8.14) k k k

4 where we have used the defining equation (8.9) for msc, and used that |msc| 6 1 in the error term. After rearranging (8.14) we get

2 2 3 X 2 X 3 [(1 − mscS)v]i = Ei := −mscΥi + msc sikvk − Υi + O sikvk − Υi , (8.15) k k

Linearization II. If we know that vi is small, then it is useful to rewrite (8.10) into the following form: X 1 1 − sikvk + Υi = − , (8.16) msc + vi msc k

where we have also used the defining equation of msc. If |vi| = o(1), then we expand vi to the second order 2 and multiply both sides by msc to obtain

2 2 1 2 3 [(1 − mscS)v]i = Ei := −mscΥi + vi + O(|vi| ) , (8.17) msc

2 where we have estimated |msc| > c in the last term (see Lemma 6.2).

51 2 Notice that the definitions of Ei in (8.15) and (8.18) are different, although their leading behavior −mscΥi 2 is the same. In both cases we can continue the analysis by inverting the operator (1 − mscS) to obtain 1 1 v = E, hence kvk kEk = ΓkEk , (8.18) 2 ∞ 6 2 ∞ ∞ 1 − mscS 1 − mscS ∞→∞

and this relation shows how the quantity Γ, defined in (6.17), emerges. If the error term is indeed small and Γ is bounded, then we obtain that kvk∞ = max |Gii − msc| is small. While the expansion logic behind equations (8.14) and (8.17) is the same and the resulting formulae are very similar, the structure of the main proof depends on which linearization of the self-consistent equation is used. In both cases we need to derive an a-priori bound to ensure that the expansion is valid. Intuitively, the P smallness of k sikvk − Υi seems easier than that of vi since both terms are averaged quantities and extra averaging typically helps. But these are random objects and every estimate comes with an exceptional set in the probability space where it does not hold. It turns out that on the technical level it is worth minimizing the bookkeeping of these events and this reason favors the second version of the linearization which operates with controlling a single quantity, maxi |vi|. In this book, therefore, we will follow the linearization (8.17). We remark that the other option was used in [69, 70] which required first proving the weak semicircle law, Theorem 7.1, to provide the necessary a priori bound. The linearization (8.17) circumvents this step and the a priori bound on vi will be proved directly.

8.3 Proof of the local semicircle law without using the spectral gap

In this section we prove a restricted version of Theorem 6.7, namely we replace threshold ηeE with a larger threshold ηE defined as ( ) 1  M −γ M −2γ  : : ηE = min η 6 min 3 , 4 holds for all ξ > η . (8.19) Mξ Γ(E + iξ) Γ(E + iξ) Im msc(E + iξ)

This definition is exactly the same as (6.28), but Γe is replaced with the larger quantity Γ, in other words we do not make use of the spectral gap in S. This will pedagogically simplify the presentation but it will prove the estimates in Theorem 6.7 only for the η > ηE regime. In Section 9 we will give the proof for the entire η > ηeE regime. We recall Lemma 6.3 showing that there is no difference between Γ and Γe for generalized Wigner matrices away from the edges (both are of order 1), so readers interested in the local semicircle law only in the bulk should be content with the simpler proof. Near the spectral edges, however, there is a substantial difference. Note that even in the Wigner case (see (6.22)), ηE is much larger near the spectral edges than the optimal threshold ηeE. For generalized Wigner matrices, while ηeE  1/N, the threshold ηE 3/2 is determined by the relation NηE(κE + ηE)  1, where κE = |E| − 2 is the distance of E from the spectral edges. We stress, however, that the proof given below does not use any model-specific upper bound on Γ, such as (6.22) or (6.24), only the trivial lower and upper bounds, (6.19) and (6.20), are needed. The actual size of Γ enters only implicitly by determining the threshold ηE. This makes the argument applicable to a wide class of problems beyond generalized Wigner matrices, including band matrices, see [55].

Definition 8.4. We call a deterministic nonnegative function Ψ ≡ Ψ(N)(z) an admissible control parameter if we have −1/2 −c cM 6 Ψ 6 M (8.20) for some constant c > 0 and large enough N. Moreover, after fixing a γ > 0, we call any (possibly N- dependent) subset (N) n −1 o D = D ⊂ z : |E| 6 10,M 6 η 6 10 a spectral domain.

52 In this section we will mostly use the spectral domain n o S := z : |E| 6 10, η ∈ [ηE, 10] , (8.21)

where we note that 1 η M −1+γ (8.22) E > 8 using the lower bound Γ > c, from (6.19), in the definition (8.19). Define the random control parameters

Λo := max|Gij| , Λd := max|Gii − msc| , Λ := max(Λo, Λd) . (8.23) i6=j i where the letters d and o refer to diagonal and off-diagonal elements. In the typical regime we will work, all these quantities are small. The key quantity is Λ and we will develop an iterative argument to control it. We first derive an estimate of Λo + |Υi| in terms of Λ. This will be possible only in the event when Λ −c is already small, so we will need to introduce an indicator function φ = 1(Λ 6 M ) with some small c. More generally, we will consider any indicator function φ so that φΛ ≺ M −c. Notice that this is a somewhat −c weaker concept than φΛ 6 M (even if the exponent c is slightly adjusted), but it turns out to be more flexible since algebraic manipulations involving ≺ (Proposition 6.5) can be directly used.

8.3.1 Large deviation bounds on Λo and Υi Lemma 8.5. The following statements hold for any spectral domain D. Let φ be the indicator function of some (possibly z-dependent) event. If φΛ ≺ M −c for some c > 0 holds uniformly in z ∈ D, then s Im m + Λ φΛ + |Z | + |Υ | ≺ sc (8.24) o i i Mη uniformly in z ∈ D. Moreover, for any fixed (N-independent) η > 0 we have

−1/2 Λo + |Zi| + |Υi| ≺ M (8.25) uniformly in z ∈ {w ∈ D : Im w = η}. In other words, (8.24) means that s Im m + Λ Λ + |Z | + |Υ | ≺ sc , (8.26) o i i Mη on the event whereΛ ≺ M −c has been apriori established. −c Proof of Lemma 8.5. We first observe that φΛ ≺ M  1 and the positive lower bound |msc(z)| > c implies that φ ≺ 1. (8.27) |Gii| A simple iteration of the expansion formula (8.1) concludes that

(T) −c (T) φ φ|Gij | ≺ M , for i 6= j, φ|Gii | ≺ 1, ≺ 1 (8.28) (T) |Gii | for any subset T of fixed cardinality. We begin with the first statement in Lemma 8.5. First we estimate Zi, which we split as

(i) (i)

X 2  (i) X (i) φ|Zi| 6 φ |hik| − sik Gkk + φ hikGkl hli . (8.29) k k6=l

53 (i) N We estimate each term using Theorem 7.7 by conditioning on G and using the fact that the family (hik)k=1 is independent of G(i). By (7.57) the first term of (8.29) is stochastically dominated by

(i) 1/2 h X 2 (i) 2i −1/2 φ sik|Gkk| ≺ M , k

where (8.28), (6.3) and (6.1) were used. For the second term of (8.29) we apply (7.59) from Theorem 7.7 1/2 (i) 1/2 −1/2 with akl = sik Gkl sli and Xk = sik hik. We find

(i) (i) (i) X (i) 2 φ X (i) 2 φ X (i) Im msc + Λ φ sik G sli sik G = sik Im G ≺ , (8.30) kl 6 M kl Mη kk Mη k,l k,l k where in the last step we used (8.1) and the estimate 1/Gii ≺ 1. Thus we get s Im m + Λ φ|Z | ≺ sc , (8.31) i Mη where we absorbed the bound M −1/2 on the first term of (8.29) into the right-hand side of (8.31). Here we only needed to use Im msc(z) > cη as follows from an explicit estimate, see (6.13). Next, we estimate Λo. We can iterate (8.2) once to get, for i 6= j,

(i) (ij) ! X (i) (i) X (ij) Gij = −Gii hikGkj = −GiiGjj hij − hikGkl hlj . (8.32) k k,l

−1/2 The term hij is trivially O≺(M ). In order to estimate the other term, we invoke (7.58) from Theorem 1/2 (ij) 1/2 −1/2 −1/2 7.7 with akl = sik Gkl slj , Xk = sik hik, and Yl = slj hlj. As in (8.30), we find

(i) X (ij) 2 Im msc + Λ φ sik G slj ≺ , kl Mη k,l

and thus s Im m + Λ φΛ ≺ sc , (8.33) o Mη

−1/2 where we again absorbed the term hij ≺ M into the right-hand side. In order to estimate Ai and hii in the definition of Υi, we use (8.28) to get s s Im m Im m + Λ φ[|A | + |h |] ≺ φΛ2 + M −1/2 φΛ + C sc ≺ sc , i ii o 6 o Mη Mη

where the second step follows from Im msc > cη. Collecting (8.31), (8.33), this completes the proof of (8.24). (i) (ij) The proof of (8.25) is almost identical to that of (8.24). The quantities Gkk and Gkk are estimated by the trivial deterministic bound η−1 = O(1). We omit the details.

8.3.2 Initial bound on Λ In this subsection, we prove an initial bound asserting that Λ ≺ M −1/2 holds for z with a large imaginary part η. This bound is rather easy to get after we have proved in (8.25) that the error term Υi in the self-consistent equation (8.16) is small.

54 Lemma 8.6. We have Λ ≺ M −1/2 uniformly in z ∈ [−10, 10] + 2i.

Proof. We shall make use of the trivial bounds

(T) 1 1 1 1 G = , |msc| = , (8.34) ij 6 η 2 6 η 2 where the last inequality follows from the fact that msc is the Stieltjes transform of a probability measure. From (8.25) we get −1/2 Λo + |Zi| ≺ M . (8.35) Moreover, we use (8.1) and (8.32) to estimate

(i) (ij) G G X ij ji −1 X (i) X (ij) −1/2 |Ai| 6 sij 6 M + sij GjiGjj hij − hikGkl hlj ≺ M , Gii j j k,l

where the last step follows by using (7.58), exactly as the estimate of the right-hand side of (8.32) in the −1/2 proof of Lemma 8.5. We conclude that |Υi| ≺ M . Next, we write (8.16) as m P s v − Υ  v = sc k ik k i . (8.36) i −1 P msc − k sikvk + Υi −1 Using |msc | > 2 and |vk| 6 1 as follows from (8.34), we find

−1 X −1/2 m + sikvk − Υi 1 + O≺(M ) . sc > k

Using |msc| 6 1/2 and taking the maximum over all i in (8.36), we therefore conclude that

Λ + O (M −1/2) Λ Λ d ≺ = d + O (M −1/2) , (8.37) d 6 −1/2 ≺ 2 + O≺(M ) 2

from which the claim follows together with the estimate on Λo from (8.35).

8.3.3 A rough bound on Λ and a continuity argument The next step is to get a rough bound to Λ by a continuity argument.

Proposition 8.7. We have Λ ≺ M −γ/3Γ−1 uniformly in the domain S defined in (8.21).

Proof. The core of the proof is a continuity argument. The first task is to establish a gap in the range of Λ by establishing a dichotomy. Roughly speaking, the following lemma asserts that, for all z ∈ S, with high −γ/2 −1 −γ/4 −1 probability either Λ 6 M Γ or Λ > M Γ , i.e., there is a gap or forbidden region in the range of Λ with very high probability.

Lemma 8.8. We have the bound

−γ/4 −1 −γ/2 −1 1 Λ 6 M Γ Λ ≺ M Γ uniformly in S.

55 Proof. Set −γ/4 −1 φ := 1 Λ 6 M Γ . −γ/4 −1 −γ/4 Then by definition we have φΛ 6 M Γ 6 CM , where in the last step we have used that Γ is q Im msc+Λ bounded below (6.19). Hence we may invoke (8.24) to estimate Λo and Υi by Mη . In order to estimate Λd, we use (8.18) to get

 s  2 Im msc + Λ φΛd = φ max|vi| ≺ Γ Λ + . (8.38) i Mη

Recalling (6.19) and (8.24), we therefore get s  Im m + Λ φΛ ≺ φΓ Λ2 + sc . (8.39) Mη

Next, by definition of φ we may estimate

2 −γ/2 −1 φΓΛ 6 M Γ . Moreover, by definition of S, we have

1 M −γ M −2γ  6 min 3 , 4 . (8.40) Mη Γ Γ Im msc Together with the definition of φ, we have s s s Im m + Λ Im m Γ−1 φΓ sc Γ sc + Γ M −γ Γ−1 + M −γ/2Γ−1 2M −γ/2Γ−1 . Mη 6 Mη Mη 6 6

(Notice that the middle inequality is the crucial place where the definition of ηE and the restriction η > ηE are used.) Plugging this into (8.39) yields φΛ ≺ M −γ/2Γ−1, which is the claim. If we knew that Λ is excluded from the interval [M −γ/2Γ−1,M −γ/4Γ−1], then we could immediately finish the proof of Proposition 8.7. We could argue that Λ = Λ(E + iη) is continuous in η = Im z and hence cannot jump from one side of the gap to the other; moreover, for η = 2 it is below the gap by Lemma 8.6, so Λ is below the gap for all z ∈ S with high probability. For a pictorial illustration of this argument, see Figure 1. However, Lemma 8.8 guarantees a gap in the range of Λ only with a very high probability for each fixed z. We need to use a fine discrete grid in the space of z to upgrade this statement to all z with high probability. Then the continuity argument described in the previous paragraph will be valid with high probability. In the next step we explain the details. The continuity argument Fix D > 10. Lemma 8.8 implies that for each z ∈ S the probability that Λ falls into the gap (or the forbidden region) is very small, i.e., we have

 −γ/3 −1 −γ/4 −1 −D P M Γ(z) 6 Λ(z) 6 M Γ(z) 6 N (8.41)

for N > N0, where N0 ≡ N0(γ, D) does not depend on z. All argument below is valid for N > N0. 10 Next, take a lattice ∆ ⊂ S such that |∆| 6 N and for each z ∈ S there exists a w ∈ ∆ such that −4 |z − w| 6 N . Then (8.41) combined with a union bounds gives

 −γ/3 −1 −γ/4 −1 −D+10 P ∃w ∈ ∆ : M Γ(w) 6 Λ(w) 6 M Γ(w) 6 N . (8.42)

56 Figure 1: The (η, Λ)-plane for a fixed E with the graph of η → Λ(E + iη). The shaded region is forbidden with high probability by Lemma 8.8. The initial estimate, Lemma 8.6, is marked with a black dot. The graph of Λ = Λ(E + iη) is continuous and lies beneath the shaded region. Note that this method does not control Λ(E + iη) in the regime η 6 ηE.

From the definitions of Λ(z), Γ(z), and S (recall (6.19)), we immediately find that Λ and Γ are Lipschitz continuous on S, with Lipschitz constant at most M 2. Hence (8.42) implies that

 −γ/3 −1 −1 −γ/4 −1 −D+10 P ∃z ∈ S : 2M Γ(z) 6 Λ(z) 6 2 M Γ(z) 6 N , i.e., a slightly smaller gap is present simultaneously for all z ∈ S and not only for the discrete lattice ∆. −D+10 We conclude that there is an event Ξ satisfying P(Ξ) > 1 − N such that, for each z ∈ S, either −γ/3 −1 −1 −γ/4 −1 1(Ξ)Λ(z) 6 2M Γ(z) or 1(Ξ)Λ(z) > 2 M Γ(z) . Since Λ is continuous and S is by definition connected, we conclude that either

−γ/3 −1 ∀z ∈ S : 1(Ξ)Λ(z) 6 2M Γ(z) (8.43) or −1 −γ/4 −1 ∀z ∈ S : 1(Ξ)Λ(z) > 2 M Γ(z) . (8.44) Here the bounds (8.43) and (8.44) each hold surely, i.e., for every realization of Λ(z). It remains to show that (8.44) is impossible. In order to do so, it suffices to show that there exists a z ∈ S such that Λ(z) < 2−1M −γ/4Γ(z)−1 with probability greater than 1/2. But this holds for any z −1 with Im z = 2, as follows from Lemma 8.6 and the bound Γ 6 Cη (6.20). This concludes the proof of Proposition 8.7.

8.3.4 Fluctuation averaging lemma Recall the definition of the small control parameter Π from (6.30) and the definition of the average [a] of a N vector (ai)i=1 from (8.12). We have shown in Lemma 8.5 that |Υi| ≺ Π, but in fact the average [Υ] is one order better; it is Π2. This is due to the fluctuation averaging phenomenon which we state as the following lemma. We will explain its proof and related earlier results in Section 10. We shall perform the averaging with respect to a family of complex weights T = (tik) satisfying

−1 X 0 6 |tik| 6 M , |tik| 6 1 . (8.45) k

57 −1 Typical example weights are tik = sik and tik = N . Note that in both of these cases T commutes with S. Lemma 8.9 (Fluctuation averaging). Fix a spectral domain D and a deterministic control parameter Ψ satisfying (8.20). Let the weights T = (tik) satisfy (8.45). (i) If Λ ≺ Ψ, then we have X 2 tikQkGkk = O≺(Ψ ) . (8.46) k

−c (ii) If Λd ≺ M for some c > 0 and Λo ≺ Ψo (with Ψo also satisfying (8.20)), then we have

X 1 2 tikQk = O≺(Ψo) . (8.47) Gkk k

(iii) Assume that T commutes with S and Λ ≺ Ψ. Then we have

X 2 tikvk = O≺(ΓΨ ) . (8.48) k If, additionally, we have X tik = 1, for all i, (8.49) k then X 2 tik(vk − [v]) = O≺(ΓΨe ) , for all i, (8.50) k

where we defined vi := Gii − m. The estimates (8.46)–(8.48), and (8.50) are uniform in the index i.

Notice that the last statement (8.50) involves the parameter Γe instead of Γ indicating that the spectral gap of S is used in its proof. We now prove a simple corollary of the bound (8.47).

−c Corollary 8.10. Suppose that Λd ≺ M for some c > 0 and Λo ≺ Ψo for some deterministic control P parameter Ψo satisfying (8.20). Suppose that the weights T = (tik) satisfy (8.45). Then we have k takΥk = 2 O≺(Ψo).

Proof. The claim easily follows from the Schur complement formula (8.7) written in the form 1 Υi = Ai + Qi . Gii P 2 We may therefore estimate k takΥk using the trivial bound |Ai| ≺ Ψo as well as the fluctuation averaging bound from (8.47).

8.3.5 The final iteration scheme First note that Proposition 8.7 guarantees that φ ≡ 1 may be chosen in Lemma 8.5, since the condition −c φΛ ≺ M is satisfied. Therefore Lemma 8.5 asserts that Λo is stochastically dominated by s s Im m + Λ Im m M ε sc M −εΛ + sc + , Mη 6 Mη Mη

where we have used the Schwarz inequality. The next lemma, the main estimate behind the proof of Theorem 6.7, extends this estimate to bound Λd with the same quantity. Thus, roughly speaking, we can estimate Λ by M −εΛ plus a deterministic error term. This gives a recursive relation on the upper bound for Λ which will be the basic step of our iteration scheme.

58 Proposition 8.11. Let Ψ be a control parameter satisfying

−1/2 −γ/3 −1 cM 6 Ψ 6 M Γ (8.51) and fix ε ∈ (0, γ/3). Then on the domain S we have the implication

Λ ≺ Ψ =⇒ Λ ≺ F (Ψ) , (8.52) where we defined s Im m M ε F (Ψ) := M −εΨ + sc + . Mη Mη

Proof. Suppose that Λ ≺ Ψ for some deterministic control parameter Ψ satisfying (8.51). We invoke Lemma 8.5 with φ = 1 (recall the bound (6.19)) to get s s Im m + Λ Im m + Ψ Λ + |Z | + |Υ | ≺ sc ≺ sc . (8.53) o i i Mη Mη

Next, we estimate Λd. Define the z-dependent indicator function

−γ/4 ψ := 1(Λ 6 M ) . By (8.51), (6.19), and the assumption Λ ≺ Ψ, we have 1 − ψ ≺ 0. On the event {ψ = 1}, (8.17) is rigorous and we get the bound

X 2 ψ|vi| Cψ sikvk − Υi + CψΛ . 6 k P Using the fluctuation averaging estimate (8.48) to bound k sikvk and (8.53) to bound Υi, we find s Im m + Ψ ψ|v | ≺ ΓΨ2 + sc , (8.54) i Mη

2 2 where we used the assumption Λ ≺ Ψ and the lower bound from (6.19) so that CψΛ 6 ΓΨ . Since the set {ψ = 0} has very small probability (using our notation, this is expressed by 1 − ψ ≺ 0), we conclude s Im m + Ψ Λ ≺ ΓΨ2 + sc , (8.55) d Mη which, combined with (8.53), yields s Im m + Ψ Λ ≺ ΓΨ2 + sc . (8.56) Mη

−γ/3 −1 Using Schwarz inequality and the assumption Ψ 6 M Γ we conclude the proof. Finally, we complete the proof of Theorem 6.7. It is easy to check that, on the domain S, if Ψ satisfies (8.51) then so does F (Ψ). In fact this step is the origin of the somewhat complicated definition of ηE. We may therefore iterate (8.52). This yields a bound on Λ that is essentially the fixed point of the map Ψ 7→ F (Ψ), which is given by Π, defined in (6.30), (up to the factor M ε). More precisely, the iteration is −γ/3 −1 started with Ψ0 := M Γ ; the initial hypothesis Λ ≺ Ψ0 is provided by Proposition 8.7. For k > 1 we −1 set Ψk+1 := F (Ψk). Hence from (8.52) we conclude that Λ ≺ Ψk for all k. Choosing k := dε e yields s Im m M ε Λ ≺ sc + . Mη Mη

59 Since ε was arbitrary, we have proved that s Im m (z) 1 Λ ≺ Π = sc + , (8.57) Mη Mη which is (6.31) What remains is to prove (6.32). On the set {ψ = 1}, we once again use (8.17) to get   2 X 2 ψmsc − sikvk + Υi = −ψvi + O(ψΛ ) . (8.58) k Averaging in (8.58) yields 2  2 ψmsc −[v] + [Υ] = −ψ[v] + O(ψΛ ) . (8.59)

By (8.57) and (8.53) with Ψ = Π, we have Λ + |Υi| ≺ Π. Moreover, by Corollary 8.10 (with the choice of 1 2 tik = N ) we have |[Υ]| ≺ Π . Thus we get

2 2 ψ[v] = mscψ[v] + O≺(Π ) .

2 2 Since 1 − ψ ≺ 0, we conclude that [v] = msc[v] + O≺(Π ). Therefore

2     Π Im msc 1 2 Γ 2 C |[v]| ≺ 2 6 2 + 2 6 C + 6 . (8.60) |1 − msc| |1 − msc| |1 − msc|Mη Mη Mη Mη Mη

2 Here in the third step we used the elementary explicit bound Im msc 6 C|1 − msc| from (6.12)–(6.13) and 2 −1 the bound Γ > |1 − msc| from (6.21). Since |mN − msc| = |[v]|, this concludes the proof of (6.32). The proof of (6.33) is exactly the same, just in the third inequality of (8.60) we may use the stronger bound √ Im msc  η/ κ + η from (6.13) in the regime |E| > 2. This completes the proof of Theorem 6.7 in the entire regime S, i.e., for η > ηE.

8.3.6 Summary of the proof: a recapitulation In order to highlight the main ideas again, we summarize the key steps leading to the proof of Theorem 6.7 restricted to the regime η > ηE. As a guiding principle we stress that the main control parameter in the proof is Λ = maxij{|Gii − msc|, |Gij|}. The main goal is to get a closed, self-improving inequality for Λ, at least with a very high probability. We needed the following ingredients:

1) A self consistent system of equations for the diagonal elements of the resolvent, written as 1 1 −(Sv)i + Υi = − , where vi = Gii − msc. (8.61) msc + vi msc

(i) (i) Since Υi contains the resolvent G , we also need perturbation formulas connecting G and G to make these equations self-consistent.

2) A large deviation bound on the off-diagonal elements and on the fluctuation, i.e., on Λo + |Zi|, as s Im m + Λ Λ + |Z | + |Υ | ≺ sc ; (8.62) o i i Mη

−1 3) An initial estimate on Λ for large η, where the a priori bounds |Gij| 6 η 6 C are effective; 4) A dichotomy, showing that there is a forbidden region for Λ:

−γ/4 −1 −γ/2 −1 1 Λ 6 M Γ Λ ≺ M Γ ;

60 5) A crude bound on Λ down to small η obtained from the dichotomy and the initial estimate via a continuity argument; 6) Application of the fluctuation averaging lemma, i.e., Lemma 8.9, to estimate Sv in the self-consistent 2 equation. Thus Sv becomes of order Λ , i.e., one order higher than the trivial bound |vi| 6 Λ. This “boost” exploits the cancellations in averages of weakly correlated centered random variables and it constitutes the crucial improvement over Section 7. Thus we can use (8.17) to have

2 vi . Υi + O(Λ ). (8.63)

7) Combination of the large deviation bound (8.62) for Υi with (8.63) provides an estimate on Λd = maxi |vi| in terms of Λ that is better than the trivial bound Λd 6 Λ. Together with the large deviation bound on Λo in (8.62), we have a closed inequality for Λ: s Im m M ε Λ M −εΛ + sc + . 6 Mη Mη

The error term M −εΛ can be absorbed into the left side and we have proved the estimate on Λ asserted in Theorem 6.7. In practice, the last inequality was formulated in terms of a control parameter and we used an iteration scheme.

8) Averaging the self-consistent equation (8.17) in order to obtain a stronger estimate on the average quantity [v]. Fluctuation averaging is used again, this time to the average of Υ, so that it becomes one order higher. This concludes the proof of Theorem 6.7 for η > ηE.

61 9 Sketch of the proof of the local semicircle law using the spectral gap

In Section 8.3 we proved the local semicircle law, Theorem 6.7, uniformly for η > ηE instead of the larger regime η > ηeE. Recall that for generalized Wigner matrices the two thresholds ηE and ηeE are determined 3/2 by the relation NηE(κE + ηE)  1 and ηeE  1/N. Hence these two thresholds coincide in the bulk but substantially differ near the edge. In this section we sketch the proof of Theorem 6.7 for any η > ηeE, i.e., for the optimal domain for η. For the complete proof we refer to Section 6 of [55]. This section can be read independently of Section 8.3. We point out that the difference between these two thresholds stems from the difference between Γ and Γ,e see (6.28) and (8.19). The bound Γ on the norm of (1−m2S)−1 entered the proof when the self-consistent equation (8.17) were solved. The key idea in this section is that we can use Γe to solve the self-consistent equation (8.17) separately on the subspace of constants (the span of the vector e) and on its orthogonal complement e⊥.

Step 1. We bound the control parameter Λ = Λ(z) from (8.23) in terms of Θ := |mN − msc|. This is ˜ the content of the following lemma. From now on, we assume that c 6 Γ 6 C, which is valid for generalized Wigner matrices, see (6.24). This will simplify several estimates in the argument that follows; for the general case, we refer to [55]. Lemma 9.1. Define the z-dependent indicator function

−γ/4 φ := 1(Λ 6 M ) (9.1) and the random control parameter s Im m + Θ 1 q(Θ) := sc + , Θ = |m − m |. (9.2) Mη Mη N sc

Then we have φΛ ≺ Θ + q(Θ) . (9.3) Proof. For the whole proof we work on the event {φ = 1}, i.e., every quantity is multiplied by φ. We consistently drop these factors φ from our notation in order to avoid cluttered expressions. In particular, we −γ/4 set Λ 6 M throughout the proof. We begin by estimating Λo and Λd in terms of Θ. Recalling (6.19), we find that φ satisfies the hypotheses of Lemma 8.5, from which we get s Im m + Λ Λ + |Υ | ≺ r(Λ) , r(Λ) := sc . (9.4) o i Mη

In order to estimate Λd, from (8.17), we have, on the event {φ = 1}, that

2 X 2  vi − msc sikvk = O≺ Λ + r(Λ) ; (9.5) k

−1 P here we used the bound (9.4) on |Υi|. Next, we subtract the average N i from each side to get

2 X 2  (vi − [v]) − msc sik(vk − [v]) = O≺ Λ + r(Λ) , k P where we used k sik = 1. Note that the average over i of the left-hand side vanishes, so that the average of the right-hand side also vanishes. Hence the right-hand side is perpendicular to e. Inverting the operator 2 ⊥ 1 − mscS on the subspace e and using Γe 6 C therefore yields 2 vi − [v] ≺ Λ + r(Λ) . (9.6)

62 Combining with the bound Λo ≺ r(Λ) from (9.4) and recalling that Θ = |[v]|, we therefore get

Λ ≺ Θ + Λ2 + r(Λ) . (9.7)

2 −γ/4 By definition of φ we have Λ 6 M Λ, so that the second term on the right-hand side of (9.7) may be absorbed into the left-hand side: Λ ≺ Θ + r(Λ) , (9.8) where we used the cancellation property (6.25). Using (9.8) and the Cauchy-Schwarz inequality, we get s s s s Im m + Λ Im m + Θ r(Λ) Im m + Θ 1 r(Λ) sc ≺ sc + sc + M −εr(Λ) + M ε 6 Mη Mη Mη 6 Mη Mη

for any ε > 0. Since ε > 0 can be arbitrarily small, we conclude that

r(Λ) ≺ q(Θ) . (9.9)

Clearly, (9.3) follows from (9.8) and (9.9).

Step 2. Having estimated Λ in terms of Θ, we can derive a closed relation involving only Θ. More precisely, we will show that the following self-consistent equation holds:

 2 −1 2 2 −γ/4 2 φ (1 − msc)[v] − msc [v] = φ O≺ q(Θ) + M Θ . (9.10)

Considering the right hand side as a small error, we will view (9.10) as a small perturbation of a quadratic equation for [v]. Recalling that Θ = |[v]|, the error term can be determined self-consistently. Thus, up to the accuracy of the error terms, we can solve (9.10) for [v]. We now give a heuristic proof for (9.10). There are two main issues which we ignore in this sketch. First, we work only on the event {φ = 1}, i.e., we assume that an a-priori bound Λ  1 has already been proved uniformly for all η > ηeE. It will require a separate argument to show that the complement event {φ = 0} is negligible and in some sense this constitutes the essential part to extend the proof of Theorem 6.7 from η > ηE to the larger regime η > ηeE. Second, we will neglect certain subtleties of the ≺ relation. While most arithmetics involving ≺ works in the same way as the usual inequality relation 6, the cancellation rule (6.25) requires the coefficient of X on the right hand side to be small. We will disregard this requirement below and we will treat ≺ as the usual 6. Since we work on the event {φ = 1}, it is understood that every quantity below is multiplied by φ, but we will ignore this fact in the formulas. Recall from (8.17) that

2 X 2 −1 2 3 vi − msc sikvk + mscΥi = msc vi + O(Λ ) . (9.11) k

In order to take the average over i and get a closed equation for [v], we write, using (9.6),

2 2 2  2 2 vi = [v] + vi − [v] = [v] + 2[v](vi − [v]) + O≺ Λ + r(Λ) .

Plugging this back into (9.11) and taking the average over i gives

2 2 −1 2  3 2 (1 − msc)[v] + msc[Υ] = msc [v] + O≺ Λ + r(Λ) .

1 By Corollary 8.10 with tik = N , we can estimate

2 2 2 [Υ] ≺ Λo ≺ r(Λ) ≺ q(Θ) , (9.12)

63 where we have used (8.24) and (9.9). Hence we have

2 −1 2  3 2  3 2 (1 − msc)[v] − msc [v] = O≺ Λ + r(Λ) 6 O≺ Θ + q(Θ) ,

where we have used (9.3). This shows (9.10).

Step 3. Finally, we need to solve the approximately quadratic relation (9.10) for [v], keeping in mind thatΘ= |[v]|. This will give the main estimates (6.32) and (6.33) in Theorem 6.7. The entry-wise version of the local semicircle law, (6.31), will then directly follow from (9.3) on the event {φ = 1}. Recall the definition of q from (9.2) so that

Im m + Θ 1 q(Θ)2  sc + , (9.13) Mη (Mη)2

and recall from Lemma 6.2 that (√ 2 √ κ + η if |E| 6 2 c |msc(z)| 1 − cη , |1 − m (z)|  κ + η, Im msc(z)  (9.14) 6 6 sc √ η κ+η if |E| > 2 .

In particular, the coefficient of Θ2 = [v]2 in the left hand side of (9.10) is a non vanishing constant, in the right hand side it is at most M −γ/4  1, so the latter can be absorbed into the former. We can thus rewrite (9.10) as [v]2 − B[v] = D, (9.15) where

2 −1 Im msc 1 1 − m + O (Mη) + 2 Im m 1 B = sc  1 − m2 + O(Mη)−1, |D| Mη (Mη)  sc + . −1 −γ/4 sc 6 −1 −γ/4 2 msc + O(M ) msc + O(M ) Mη (Mη) This quadratic equation has two solutions √ B ± B2 + 4D [v] = , (9.16) 2 the correct one is selected by a continuity argument. Fix an energy |E| 6 2, consider [v] = [v(E + iη)] as a function of η and look at the two solution branches (9.16) as continuous functions of η. It is easy to check that for η  1 large we have |B|  1 and |D|  1, so the two solution are well −1 separated, one of them is close to 0, the other one close to B. The a priori bound |[v]| 6 η  1 guarantees that the correct branch is the one near 0. A more careful analysis of (9.16) with the given coefficient functions B(η) and D(η) shows that on this branch the following inequality holds:

−1 Θ = |[v]| . (Mη) . (9.17) To see this more precisely, we distinguish two cases by reducing η from a very large value to 0. In the √ √ √ 2 first regime, where Mη κ + η  1, we have |B|  κ + η and |D| . κ + η/Mη  |B| , so the two −1 branches remain separated and the relevant branch satisfies Θ . |D/B| . (Mη) . As we decrease η and √ p −1 enter the regime Mη κ + η . 1 both solutions satisfy Θ . |B| + |D|. In this regime |B| . (Mη) and −2 |D| . (Mη) , so we also have (9.17). In both cases we conclude the final bound (6.32). The proof of (6.33) for the regime |E| > 2 is similar. The improvement comes from the fact that in this regime (9.14) gives a better bound for Im msc(z).

64 10 Fluctuation averaging mechanism 10.1 Intuition behind the fluctuation averaging Before starting the proof of Lemma 8.9 we explain why it is so important in our context. We also give a heuristic second moment calculation to show the essential mechanism. The leading error in the self-consistent equation (8.58) for vi is Υi. Among the three summands of Υi, see (8.11), typically Zi is the largest, thus we typically have

vi ≺ Zi (10.1) in the regime where Γ is bounded. The large deviation bounds in Theorem 7.7 show that Zi ≺ Λo and a simple second moment calculation shows that this bound is essentially optimal. On the other hand, (8.3) shows that the typical size of the off-diagonal resolvent matrix elements is at least of order (Nη)−1/2, thus the estimate 1 Z ≺ Λ ≺ √ i o Nη is essentially optimal in the generalized Wigner case (M  N). Together with (10.1) this shows that the natural bound for Λ is (Nη)−1/2, which is also reflected in the bound (6.31). 2 −1 However, the bound (6.32) for the average, mN − m = [v], is of order Λ  (Nη) , i.e., it is one order better than the bound for vi. For the purpose of estimating [v], it is the average [Υ] of the leading errors Υi that matters, see (8.59). Since Zi, the leading term in Υi, is a fluctuating quantity with zero expectation, the improvement comes from the fact that fluctuations cancel out in the average. The basic example of this phenomenon is the central limit theorem; the average of N independent centered random variables is −1/2 typically of order N , much smaller than the size of each individual variable. In our case, however, Zi are not independent. In fact, their correlations do not decay, at least in the Wigner case where the column of H with an index other than i plays the same role in the function Zi. Thus standard results on central limit theorems for weakly correlated random variables do not apply here and a new argument is needed. We first give an idea how the fluctuation averaging mechanism works by a second moment calculation. Second moment calculation. First we claim that

1 Qk ≺ Ψo. (10.2) Gkk −1 Indeed, from the Schur complement formula (8.7) we get |Qk(Gkk) | 6 |hkk| + |Zk|. The first term is −1/2 estimated by |hkk| ≺ M 6 Ψo. The second term is estimated exactly as in (8.29) and (8.30), giving (T) |Zk| ≺ Ψo. In fact, the same bound holds if Gkk is replaced with Gkk as long as the cardinality of T is bounded, see (10.10) later. −1 Abbreviate Xk := Qk(Gkk) and compute the variance 2 X X X 2 X E tikXk = tiktilEXkXl = tikEXkXk + tiktilEXkXl . (10.3)

k k,l k k6=l

Using the bounds (8.45) on tik and (10.2), we find that the first term on the right-hand side of (10.3) is −1 2 4 O≺(M Ψo) = O≺(Ψo), where we used that Ψo is admissible, recalling (8.20). Let us therefore focus on the second term of (10.3). Using the fact that k 6= l, we apply the first resolvent decoupling formula (8.1) to Xk and Xl to get  1   1   1 G G   1 G G  X X = Q Q = Q − kl lk Q − lk kl . (10.4) E k l E k l E k (l) (l) l (k) (k) Gkk Gll Gkk GkkGkkGll Gll GllGll Gkk Notice that we used (8.1) in the form

1 1 G(T)G(T) = − kl lk for any k 6= l, k, l 6∈ , (10.5) (T) (Tl) (T) (Tl) (T) T Gkk Gkk Gkk Gkk Gll

65 with T = ∅. We multiply out the parentheses on the right-hand side of (10.4). The crucial observation is that if the random variable Y is independent of i (see Definition 7.3) then not only QiY = 0 but also EQi(X)Y = EQi(XY ) = 0 holds for any X. Hence out of the four terms obtained from the right-hand side of (10.4), the only nonvanishing one is

 G G   G G  Q kl lk Q lk kl ≺ Ψ4 , E k (l) l (k) o GkkGkkGll GllGll Gkk where we used that the denominators are harmless, see (8.27)–(8.28). Together with (8.45), this concludes P 2 4 P 2 the proof of E k tikXk ≺ Ψo, which means that k tikXk is bounded by Ψo in second moment sense.

10.2 Proof of Lemma 8.9 Here we follow the presentation of the proof from [55] and we will comment on previous results in Sec- tion 10.3.1. We start with a simple lemma which summarizes the key properties of ≺ when combined with partial expectation.

−C Lemma 10.1. Suppose that the deterministic control parameter Ψ satisfies Ψ > N , and that for all p there p Cp is a constant Cp such that the nonnegative random variable X satisfies EX 6 N . Suppose moreover that X ≺ Ψ. Then for any fixed n ∈ N we have

n n EX ≺ Ψ . (10.6)

n ε n (Note that this estimate involves deterministic quantities only, i.e., it means that EX 6 N Ψ for any ε > 0 if N > N0(n, ε).) Conversely, if (10.6) holds for any n ∈ N, then X ≺ Ψ. Moreover, we have

n n n n PiX ≺ Ψ ,QiX ≺ Ψ (10.7) uniformly in i. If X = X(u) and Ψ = Ψ(u) depend on some parameter u and the above assumptions are uniform in u, then so are the conclusions. This lemma is a generalization of (6.26), but notice that the majoring quantity Ψ has to be deterministic. For general random variables, X ≺ Y may not imply the analogous relation E[X|F] ≺ E[Y |F] for conditional expectations with respect to a σ-algebra F. Proof of Lemma 10.1. To prove (10.6), it is enough to consider the case n = 1; the case of larger n follows immediately from the case n = 1, using the basic property of the stochastic domination that X ≺ Ψ implies Xn ≺ Ψn. For the first claim, pick ε > 0. Then √ ε ε ε p ε C /2−D/2 EX = EX · 1(X 6 N Ψ) + EX · 1(X > N Ψ) 6 N Ψ + EX2 P(X > N εΨ) 6 N Ψ + N 2 , for arbitrary D > 0. The first claim therefore follows by choosing D large enough. For the converse statement, we use Chebyshev inequality: for any ε > 0 and D > 0, Xn X N εΨ) E N ε−εn N −D, P > 6 N εnΨn 6 6 by choosing n large enough. Finally, the claim (10.7) follows from Chebyshev’s inequality, using a high-moment estimate combined with Jensen’s inequality for partial expectation. We omit the details, which are similar to those of the first claim. We remark that it is tempting to generalize the converse of (10.6) and try to verify stochastic domination n n X ≺ Y between two random variables by proving that EX 6 EY . This implication in general is wrong, but it is correct if Y is deterministic. This subtlety is the reason why many of the following theorems about

66 two random variables A and B are formulated in such a way that “if A satisfies A ≺ Ψ for some deterministic Ψ then, B ≺ Ψ”, instead of directly saying the more natural (but sometimes wrong) assertion that A ≺ B. We shall apply Lemma 10.1 to the resolvent entries of G. In order to verify its assumptions, we record the following bounds.

Lemma 10.2. Suppose that Λ ≺ Ψ and Λo ≺ Ψo for some deterministic control parameters Ψ and Ψo both satisfying (8.20). Fix p ∈ N. Then for any i 6= j and T ⊂ {1,...,N} satisfying |T| 6 p and i, j∈ / T we have

(T) 1 Gij = O≺(Ψo) , = O≺(1) . (10.8) (T) Gii

(T) Moreover, we have the rough bounds Gij 6 M and n 1 N ε (10.9) E (T) 6 Gii

for any ε > 0 and N > N0(n, ε). Proof. The bounds (10.8) follow easily by a repeated application of (8.1), the assumption Λ ≺ M −c, and (T) −1 the lower bound in (6.11). The deterministic bound Gij 6 M follows immediately from η > M by definition of a spectral domain. (T) In order to prove (10.9), we use the Schur complement formula (8.7) applied to 1/Gii , where the (T) expectation is estimated using (5.6) and Gij 6 M. (Recall (6.4).) This gives p 1 ≺ N Cp E (T) Gii

(T) for all p ∈ N. Since 1/Gii ≺ 1, (10.9) therefore follows from (10.6).

We now start the proof of Lemma 8.9. The main statement is (8.47), the other bounds will be relatively simple consequences. Finally, we present an alternative argument for (8.47) that organizes the stopping rule in the expansion somewhat differently.

Proof of Lemma 8.9. Part I: Proof of (8.47). First we claim that, for any fixed p ∈ N, we have

1 Qk ≺ Ψo (10.10) (T) Gkk

uniformly for T ⊂ {1,..., N}, |T| 6 p, and k∈ / T. To simplify notation, for the proof we set T = ∅; the proof −1 for nonempty T is the same. From the Schur complement formula (8.7) we get |Qk(Gkk) | 6 |hkk| + |Zk|. −1/2 The first term is estimated by |hkk| ≺ M 6 Ψo. The second term is estimated exactly as in (8.29) and (8.30): (k) !1/2 X (k) 2 |Zk| ≺ skx Gxy syk ≺ Ψo , x6=y

(k) where in the last step we used that Gxy ≺ Ψo as follows from (10.8), and the bound 1/|Gkk| ≺ 1 (recall −c that Λ ≺ Ψ 6 M ). This concludes the proof of (10.10). −1 P Abbreviate Xk := Qk(Gkk) . We shall estimate k tikXk in probability by estimating its p-th moment 2p by Ψo , from which the claim (8.47) will easily follow using Chebyshev’s inequality. After this preparation, the rest of the proof for (8.47) can be divided into four steps.

67 Step 1: Coincidence structure in the expansion of the Lp norm. Fix some even integer p and write p X X E tikXk = tik1 ··· tikp/2 tikp/2+1 ··· tikp EXk1 ··· Xkp/2 Xkp/2+1 ··· Xkp .

k k1,...,kp

Next, we regroup the terms in the sum over k := (k1, . . . , kp) according to the coincidence structure in k as follows: given a sequence of indices k, define the partition P(k) of {1, . . . , p} by the equivalence relation r ∼ s if and only if kr = ks. Denote the set of all partitions of {1, . . . , p} by Pp. Then we write p X X X E tikXk = tik1 ··· tikp/2 tikp/2+1 ··· tikp 1(P(k) = P )V (k) , (10.11)

k P ∈Pp k

where we defined

V (k) := EXk1 ··· Xkp/2 Xkp/2+1 ··· Xkp . Given a partition P , for any r ∈ {1, . . . , p}, we denote by [r] the block of r in P , i.e., the set of all indices in the same block of the partition as r. Let L ≡ L(P ) := {r : [r] = {r}} ⊂ {1, . . . , p} be the set of lone labels. We denote by kL := (kr)r∈L the summation indices associated with lone labels.

Step 2: Resolution of dependence in weakly dependent random variables. The resolvent entry Gkk depends strongly on the randomness in the k-column of H, but only weakly on the randomness in the other columns.

We conclude that if r is a lone label then all factors Xks with s 6= r in V (k) depend weakly on the randomness

in the kr-th column of H (if r is not a lone label then this statement holds only for “all factors Xks with ks 6= kr”). Thus, the idea is to make all resolvent entries inside the expectation of V (k) as independent of the indices kL as possible (see Definition 7.3), using the first decoupling resolvent identity (8.1): for x, y, u∈ / T and x, y 6= u (x can be equal to y), (T) (T) Gxu Guy (T) (Tu) Gxy = Gxy + (10.12) (T) Guu and for x, u∈ / T and x 6= u 1 1 G(T)G(T) = − xu ux . (10.13) (T) (Tu) (T) (Tu) (T) Gxx Gxx Gxx Gxx Guu

(T) Definition 10.3. A resolvent entry Gxy with x, y∈ / T is maximally expanded with respect to a set B ⊂ {1,...N} if B ⊂ T ∪ {x, y}.

(T) Given the set kL of lone indices, we say that a resolvent entry Gxy is maximally expanded if it is maximally expanded with respect to the set B = kL. The motivation behind this definition is that using (8.1) we cannot add upper indices from the set kL to a maximally expanded resolvent entry. We shall apply (8.1) to all resolvent entries in V (k). In this manner we generate a sum of monomials consisting of off-diagonal resolvent entries and inverses of diagonal resolvent entries. We can now repeatedly apply (8.1) to each factor until either they are all maximally expanded or a sufficiently large number of off-diagonal resolvent entries has been generated. The cap on the number of off-diagonal entries is introduced to ensure that this procedure terminates after a finite number of steps. In order to define the precise algorithm, let A denote the set of monomials in the off-diagonal entries (T) (T) Gxy , with T ⊂ kL, x 6= y, and x, y ∈ k \ T, as well as the inverse diagonal entries 1/Gxx , with T ⊂ kL and x ∈ k \ T. Starting from V (k), the algorithm will recursively generate sums of monomials in A. Let d(A) denote the number of off-diagonal entries in A ∈ A. For A ∈ A we shall define w0(A), w1(A) ∈ A satisfying  A = w0(A) + w1(A) , d(w0(A)) = d(A) , d(w1(A)) > max 2, d(A) + 1 . (10.14) The idea behind this splitting is to use (8.1) on one entry of A; the first term on the right-hand side of (8.1) gives rise to w0(A) and the second to w1(A). The precise definition of the algorithm applied to A ∈ A is as follows.

68 (1) If all factors of A are maximally expanded or d(A) > p + 1 then stop the expansion of A. (2) Otherwise choose some (arbitrary) factor of A that is not maximally expanded. If this entry is off- (T) (T) diagonal, Gxy , we use (10.12) to decompose Gxy into a sum of two terms with u the smallest element (T) (T) in kL \ (T ∪ {x, y}). If the chosen entry is diagonal, 1/Gxx , we use (10.13) to decompose Gxx into two terms with u the smallest element in kL \ (T ∪ {x}). The choice of u to be the smallest element is not (T) important, we just chose it for definiteness of the algorithm. From the splitting of the factor Gxy or (T) (T) (T) 1/Gxx in the monomial A, we obtain a splitting A = w0(A) + w1(A) (i.e., we replace Gxy or 1/Gxx by the right-hand sides of (10.12) or (10.13)).

It is clear that (10.14) holds with the algorithm just defined. In fact, in most cases the last inequality in  (10.14) is an equality, d(w1(A)) = max 2, d(A) + 1 . The only exception is when (10.12) is used for x = y, then two new off-diagonal entries are added, i.e., d(w1(A)) = d(A) + 2. Notice also that this algorithm contains some arbitrariness in the choice of the factor of A to be expanded but any choice will work for the proof. r We now apply this algorithm recursively to each entry A := 1/Gkr kr in the definition of V (k). More r r r r r precisely, we start with A and define A0 := w0(A ) and A1 := w1(A ). In the second step of the algorithm we define four monomials

r r r r r r r r A00 := w0(A0) ,A01 := w0(A1) ,A10 := w1(A0) ,A11 := w1(A1) , and so on, at each iteration performing the steps (1) and (2) on each new monomial independently of the others. Note that the lower indices are binary sequences that describe the recursive application of the operations w0 and w1. In this manner we generate a binary tree whose vertices are given by finite binary r r strings σ. The associated monomials satisfy Aσi := wi(Aσ) for i = 0, 1, where σi denotes the binary string obtained by appending i to the right end of σ. See Figure 2 for an illustration of the tree. Notice that any monomial that is obtained along these recursion is of the form

(T1) (T2) Gx1y1 Gx2y2 ... 0 0 , (10.15) (T1) (T2) Gz1z1 Gz2z2 ... where all factors in the numerators are off-diagonal entries (xi 6= yi) and all factors in the denominators are diagonal entries. For later usage, we shall refer to the sum

0 0 |T1| + |T2| + ... + |T1| + |T2| + ... (10.16) as the total number of upper indices in the monomial (10.15). (The absolute value denotes the cardinality of the set.) We stop the recursion of a tree vertex whenever the associated monomial satisfies the stopping rule of step (1). In other words, the set of leaves of the tree is the set of binary strings σ such that either all factors r r of Aσ are maximally expanded or d(Aσ) > p + 1. We list a few obvious facts of this algorithm:

r (i) d(Aσ) 6 p + 1 for any vertex σ of the tree (by the stopping rule in (1)).

(ii) the number of ones in any σ is at most p (since each application of w1 increases d(·) by at least one).

r (iii) the number of resolvent entries (in the numerator and denominator) in Aσ is bounded by 4p + 1 (since each application of w1 increases the number of resolvent entries by at most four, and the application of w0 does not change this number).

r (iv) The maximal total number of upper indices in Aσ for any tree vertex σ is (4p + 1)p (since |Ti| 6 p and 0 |Ti| 6 p in the formula (10.16)).

(v) σ contains at most (4p + 1)p zeros (since each application of w0 increases the total number of upper indices by one).

69 Figure 2: The binary tree generated by applying the algorithm (1)–(2) to a monomial Ar. Each vertex of the r tree is indexed by a binary string σ, and encodes a monomial Aσ. An arrow towards the left represents the r action of w0 and an arrow towards the right the action of w1. The monomial A11 satisfies the assumptions of step (1), and hence its expansion is stopped, so that the tree vertex 11 has no children.

(vi) The maximal length of the string σ (i.e., the depth of the tree) is at most (4p + 1)p + p = 4p2 + 2p.

2 p Pp 4p2+2p (vii) The number of leaves of a tree is bounded by (Cp ) (since this number is bounded by k=0 k 6 2 p (Cp ) where k is the number of ones in the string encoding the leaf ). Denoting by Lr the set of leaves r 2 p of the binary tree generated from A , we have |Lr| 6 (Cp ) . In particular, the algorithm terminates after at most (Cp2)p steps since each step generates a new leaf.

By definition of the tree and w0 and w1, we have the following resolution of dependence decomposition

X r Xkr = Qkr Aσ , (10.17)

σ∈Lr

r where each monomial Aσ for σ ∈ Lr either consists entirely of maximally expanded resolvent entries or r satisfies d(Aσ) = p + 1. (This is an immediate consequence of the stopping rule in (1)). Using (10.11) and (10.17) we have the representation X X V (k) = ··· Q A1  ··· Q Ap  . (10.18) E k1 σ1 kp σp σ1∈L1 σp∈Lp

Step 3. Bounding the individual terms in the expansion (10.18). We now claim that any nonzero term on the right-hand side of (10.18) satisfies

Q A1  ··· Q Ap  = O Ψp+|L| . (10.19) E k1 σ1 kp σp ≺ o Proof of (10.19). Before embarking on the proof, we explain its idea. First notice that for any string σ we have k b(σ)+1 Aσ = O≺ Ψo , (10.20) where b(σ) is the number ones in the string σ. Indeed, if b(σ) = 0 then this follows from (10.10); if b(σ) > 1 this follows from the last statement in (10.14) which guarantees that every one in the string σ increases the k exponent of Ψo by at least one (we also use (10.8)). In particular, each Aσ is bounded by at least Ψo. k  If we used only the trivial bound Aσ = O≺ Ψo for each factor in (10.19), i.e., we did not exploit the gain from the w1(A) type terms in the expansion, then the naive size of the left-hand side of (10.19) would p only be Ψo. The key observation behind (10.19) is that each lone label s ∈ L yields one extra factor Ψo to the estimate. This is because the expectation in (10.18) would vanish if all other factors Q Ar , r 6= s, kr σr

70 were independent of ks. The expansion of the binary tree makes this dependence explicit by exhibiting ks as a lower index. But this requires performing an operation w1 with the choice u = ks in (10.12) or (10.13). However, w1 increases the number of off-diagonal element by at least one. In other words, every index associated with a lone label must have a “partner” index in a different resolvent entry which arose by application of w1. Such a partner index may only be obtained through the creation of at least one off-diagonal resolvent entry. The actual proof below shows that this effect applies cumulatively for all lone labels. In order to give the rigorous proof of (10.19), we consider two cases. Consider first the case where for some r = 1, . . . , p the monomial Ar on the left-hand side of (10.19) is not maximally expanded. Then σr d(Ar ) = p + 1, so that (10.8) yields Ar ≺ Ψp+1. Therefore the observation that As ≺ Ψ for all s 6= r, σr σr o σs o 2p together with (10.7) implies that the left-hand side of (10.19) is O≺ Ψo . Since |L| 6 p, (10.19) follows. Consider now the case where Ar on the left-hand side of (10.19) is maximally expanded for all r = σr 1, . . . , p. The key observation is the following claim about the left-hand side of (10.19) with a nonzero expectation. (∗) For each s ∈ L there exists r := τ(s) ∈ {1, . . . , p}\{s} such that the monomial Ar contains a resolvent σr entry with lower index ks. To prove (∗), suppose by contradiction that there exists an s ∈ L such that for all r ∈ {1, . . . , p}\{s} the lower index k does not appear in the monomial Ar . To simplify notation, we assume that s = 1. Then, s σr for all r = 2, . . . , p, since Ar is maximally expanded, we find that Ar is independent of k . Therefore, we σr σr 1 have   Q A1 Q A2  ··· Q Ap  = Q A1 Q A2  ··· Q Ap  = 0 , E k1 σ1 k2 σ2 kp σp E k1 σ1 k2 σ2 kp σp where in the last step we used that EQi(X)Y = EQi(XY ) = 0 if Y is independent of i. This concludes the proof of (∗). The statement (∗) can be reformulated as asserting that, after expansion, every lone label s has a r “partner” label r = τ(s), such that the index ks appears also as a lower index in the expansion of A (note that there may be several such partner labels r, we can choose τ(s) to be any one of them ). P For r ∈ {1, . . . , p} we define `(r) := s∈L 1(τ(s) = r), the number of times that the label r was chosen as a partner to some lone label s. We now claim that Ar = O Ψ1+`(r) . (10.21) σr ≺ o −1 To prove (10.21), fix r ∈ {1, . . . , p}. By definition, for each s ∈ τ ({r}) the index ks appears as a lower index in the monomial Ar . Since s ∈ L is by definition a lone label and s 6= r, we know that k does not σr s r appear as an index in A = 1/Gkr kr , i.e., kr 6= ks. By definition of the monomials associated with the tree −1 vertex σr, it follows that b(σr), the number of ones in σr, is at least τ ({r}) = `(r) since each application of w1 adds precisely one new lower index while application of w0 leaves the lower indices unchanged. Note that in this step it is crucial that s ∈ τ −1({r}) was a lone label. Recalling (10.20), we therefore get (10.21). Using (10.21) and Lemma 10.1 we find

p 1  p  Y 1+`(r) p+|L| Qk A ··· Qk Aσ ≺ Ψ = Ψ . 1 σ1 p p o o r=1 This concludes the proof of (10.19). Summing over the binary trees in (10.18) and using Lemma 10.1, we get from (10.19)

p+|L| V (k) = O≺ Ψo . (10.22) Step 4. Summing over the expansion. We now return to the sum (10.11). We perform the summation by −1 first fixing P ∈ Pp, with associated lone labels L = L(P ). Using |tij| 6 M , (8.45), we have

X −p X 1(P(k) = P ) tik1 ··· tik tik ··· tikp M 1(P(k) = P ) . p/2 p/2+1 6 k k

71 Now the number of k with |L| lone labels can be bounded easily by M |L|+(p−|L|)/2 since each block of P that is not contained in L consists of at least two labels. Thus we can bound the last displayed equation by

X −p |L|+(p−|L|)/2 −1/2 p−|L| 1(P(k) = P ) tik1 ··· tik tik ··· tikp M M = (M ) . p/2 p/2+1 6 k

From (10.11) and (10.22) we get

p X X −1/2 p−|L(P )| p+|L(P )| 2p E tikXk ≺ (M ) Ψo CpΨo , 6 k P ∈Pp

where in the last step we used the lower bound from (8.20) and estimated the summation over Pp with a 2 p constant Cp (which is bounded by (Cp ) ). Summarizing, we have proved that p X 2p E tikXk ≺ Ψo (10.23)

k

for any p ∈ 2N. We conclude the proof of (8.47) with a simple application of Chebyshev’s inequality. Fix ε > 0 and D > 0. Using (10.23) and Chebyshev’s inequality we find   X ε 2 −εp P tikXk > N Ψo NN 6 k

−1 for large enough N > N0(ε, p). Choosing p > ε (1 + D), we conclude the proof of (8.47). Remark 10.4. The first decoupling formula (8.1) is the only identity about the entries of G that is needed in the proof of Lemma 8.9. In particular, the second decoupling formula (8.2) is never used, and the actual entries of H never appear in the argument. Part II. Proof of (8.46). The bound (8.46) may be proved by following the proof of (8.47) verbatim; the only modification is the bound (T) (T)  QkGkk = Qk Gkk − m ≺ Ψ ,

which replaces (10.10). Here we again use the same upper bound Ψo = Ψ for Λ and Λo.

Part III. Proof of (8.48) and (8.50). In order to prove (8.48), we note that the last term in (8.17) is of order 2 O≺(Ψ ). Hence we have

X 2 X 2 X 2 2 X 2 wa := taivi = msc taisikvk − msc taiΥi + O≺(Ψ ) = msc saitikvk + O≺(Ψ ) , i i,k i i,k

where in the last step we used Corollary 8.10 (recall that the proof of this corollary used only (8.47) which P 2 was already proved in Part I) to bound i taiΥi by O≺(Ψ ) and that the matrices T and S commute by N assumption. Introducing the vector w = (wa)a=1 we therefore have the equation

2 2 w = mscSw + O≺(Ψ ) , (10.24)

where the error term is in the sense of the `∞-norm (uniform in the components of the vector w). Inverting 2 the matrix 1 − mscS and recalling the definition (6.17) yields (8.48). The proof of (8.50) is similar, except that we have to treat the subspace e⊥ separately. Using (8.49) we write X X X 1 t (v − [v]) = t v − v , ai i ai i N i i i i

72 and apply the above argument to each term separately. This yields

X X X X 1 X t (v − [v]) = m2 t s v − m2 s v + O (Ψ2) ai i sc ai ik k sc N ik k ≺ i i k i k 2 X 2 = msc saitik(vk − [v]) + O≺(Ψ ) , i,k

where we used (6.1) in the second step. Note that the error term on the right-hand side is perpendicular to e when regarded as a vector indexed by a, since all other terms in the equation are. Hence we may invert the matrix (1 − m2S) on the subspace e⊥, as above, to get (8.50). This completes the proof of Lemma 8.9.

10.3 Alternative proof of (8.47) of Lemma 8.9 We conclude this section with an alternative proof of the key statement (8.47) of Lemma 8.9. While the underlying argument remains similar, the following proof makes use of an additional decomposition of the space of random variables, which avoids the use of the stopping rule from Step (1) in the above proof of Lemma 8.9. This decomposition may be regarded as an abstract reformulation of the stopping rule.

−1 Alternative proof of (8.47) of Lemma 8.9. As before, we set Xk := Qk(Gkk) . For simplicity of presen- −1 tation, we set tik = N . The decomposition is defined using the operations Pi and Qi, introduced in Definition 7.3. It is immediate that Pi and Qi are projections, that Pi + Qi = 1, and that all of these Q projections commute with each other. For a set A ⊂ {1,...,N} we use the notations PA := Pi and Q i∈A QA := i∈A Qi.

Let p be even and introduce the shorthand Xeks := Xks for s 6 p/2 and Xeks := Xks for s > p/2. Then we get p p p p ! 1 X 1 X Y 1 X Y Y E Xk = E Xeks = E (Pkr + Qkr )Xeks . N N p N p k k1,...,kp s=1 k1,...,kp s=1 r=1

Introducing the notations k = (k1, . . . , kp) and [k] = {k1, . . . , kp}, we therefore get by multiplying out the parentheses p p 1 X 1 X X Y  c E Xk = E PA QAs Xeks . (10.25) N N p s k k A1,...,Ap⊂[k] s=1

Next, by definition of X , we have that X = Q X , which implies that P c X = 0 if k ∈/ A . eks eks ks eks As eks s s Hence may restrict the summation to As satisfying

ks ∈ As (10.26) for all s. Moreover, we claim that the right-hand side of (10.25) vanishes unless [ ks ∈ Aq (10.27) q6=s

T c for all s. Indeed, suppose that ks ∈ q6=s Aq for some s, say s = 1. In this case, for each s = 2, . . . , p, the factor P c Q X is independent of k (see Definition 7.3). Thus we get As As eks 1

p p Y   Y  P c Q X = P c Q Q X P c Q X E As As eks E A1 A1 k1 ek1 As As eks s=1 s=2  p   Y  = Q P c Q X P c Q X = 0 , E k1 A1 A1 ek1 As As eks s=2

73 where in the last step we used that EQi(X) = 0 for any i and random variable X. We conclude that the summation on the right-hand side of (10.25) is restricted to indices satisfying (10.26) and (10.27). Under these two conditions we have

p X |As| > 2 |[k]| , (10.28) s=1 since each index ks must belong to at least two different sets Aq: to As (by (10.26)) as well as to some Aq with q 6= s (by (10.27)). Next, we claim that for k ∈ A we have

|A| |QAXk| ≺ Ψo . (10.29)

−1 (Note that if we were doing the case Xk = QkGkk instead of Xk = Qk(Gkk) , then (10.29) would have to |A| be weakened to |QAXk| ≺ Ψ , in accordance with (8.46). Indeed, in that case and for A = {k}, we only have the bound |QkGkk| ≺ Ψ and not |QkGkk| ≺ Ψo.) Before proving (10.29), we show how it may be used to complete the proof. Using (10.25), (10.29), and Lemma 10.1, we find

p p 1 X 1 X 2|[k]| X 2u 1 X E Xk ≺ Cp Ψo = Cp Ψo 1(|[k]| = u) N N p N p k k u=1 k p X 2u u−p −1/2 2p 2p 6 Cp Ψo N 6 Cp(Ψo + N ) 6 CpΨo , u=1

where in the first step we estimated the summation over the sets A1,...,Ap by a combinatorial factor Cp n m n+m depending on p, in the forth step we used the elementary inequality a b 6 (a + b) for positive a, b, and in the last step we used (8.20) and the bound M 6 N. Thus we have proved (10.23), from which the claim follows exactly as in the first proof of (8.47). What remains is the proof of (10.29). The case |A| = 1 (corresponding to A = {k}) follows from (10.10), exactly as in the first proof of (8.47). To simplify notation, for the case |A| > 2 we assume that k = 1 and A = {1, 2, . . . , t} with t > 2. It suffices to prove that

1 t Qt ··· Q2 ≺ Ψo . (10.30) G11

We start by writing, using the first decoupling formula (8.1),

1 1 G G G G Q = Q + Q 12 21 = Q 12 21 , 2 2 (2) 2 (2) 2 (2) G11 G11 G11G11 G22 G11G11 G22

(2) where the first term vanishes since G11 is independent of 2 (see Definition 7.3). We now consider 1 G G Q Q = Q Q 12 21 , 3 2 2 3 (2) G11 G11G11 G22 and apply again the first decoupling formula (8.1) with k = 3 to each resolvent entry on the right-hand side, and multiply everything out. The result is a sum of fractions of entries of G, whereby all entries in the numerator are diagonal and all entries in the denominator are diagonal. The leading order term vanishes,

G(3)G(3) Q Q 12 21 = 0 , 2 3 (3) (23) (3) G11 G11 G22

74 so that the surviving terms have at least three (off-diagonal) resolvent entries in the numerator. We may now continue in this manner; at each step the number of (off-diagonal) resolvent entries in the numerator increases by at least one. G12G21 More formally, we obtain a sequence A2,A3,...,At, where A2 := Q2 (2) and Ai is obtained by G11G11 G22 applying (8.1) with k = i to each entry of QiAi−1, and keeping only the nonvanishing terms. The following properties are easy to check by induction.

(i) Ai = QiAi−1.

(ii) Ai consists of the projection Q2 ··· Qi applied to a sum of fractions such that all entries in the numerator are off-diagonal and all entries in the denominator are diagonal.

(iii) The number of (off-diagonal) entries in the numerator of each term of Ai is at least i.

i By Lemma 10.1 combined with (ii) and (iii) we conclude that |Ai| ≺ Ψo. From (i) we therefore get

1 t Qt ··· Q2 = At = O≺(Ψo) . G11 This is (10.30). Hence the proof is complete.

10.3.1 History of the fluctuation averaging Lemma 8.9 is a version of the fluctuation averaging mechanism, taken from [55], that is the most useful for our purpose in proving Theorem 6.7. We now briefly comment on its history. The first version of the fluctuation averaging mechanism appeared in [68] for the Wigner case, where −1 P 2 −1 [Z] = N k Zk was bounded by Λo (recall Zi from (8.11)). Since Qk[Gkk] is essentially Zk, see (8.7), this corresponds to the first bound in (8.46). A different proof (with a better bound on the constants) was given in [70]. A conceptually streamlined version of the original proof was extended to sparse matrices [56] and to sample covariance matrices [113]. Finally, an extensive analysis in [52] treated the fluctuation averaging of general polynomials of resolvent entries and identified the order of cancellations depending on the algebraic structure of the polynomial. Moreover, in [52] an additional cancellation effect was found for the quantity 2 Qi|Gij| . All proofs of the fluctuation averaging lemmas rely on computing expectations of high moments of the averages and carefully estimating various terms of different combinatorial structure. In [52] we have developed a Feynman diagrammatic representation for bookkeeping the terms, but this is necessary only for the case of general polynomials. For the special cases stated in Lemma 8.9, the proof presented here is relatively simple.

75 11 Eigenvalue location: the rigidity phenomenon

The local semicircle law in Theorem 6.7 was proven for universal Wigner matrices, characterized by the upper bound sij 6 1/M on the variances of hij. This result implies rigidity estimates on the location of the eigenvalues with a precision depending on M; e.g. in the bulk spectrum the eigenvalues can typically be located with a precision slightly above 1/M. For the sake of a simplicity, from now on we restrict our presentation to the special case of generalized Wigner matrices, characterized by Cinf /N 6 sij 6 Csup/N with two positive constants Cinf and Csup, see Definition 2.1. So from now on the parameter M will be replaced with N. For rigidity results in the general case, see Theorem 7.6 in [55]. In this section we will show that eigenvalues for generalized Wigner matrices are quite rigid; they may fluctuate only on a scale slightly above 1/N. This is a manifestation that the eigenvalues are strongly correlated; the typical fluctuation scale of N independent, say Poisson, points in a finite interval would be N −1/2.

11.1 Extreme eigenvalues We first state a corollary to Theorem 6.7 on the extreme eigenvalues. Corollary 11.1. For any generalized Wigner matrix, the largest eigenvalue of H is bounded by 2 + N −2/3+ε for any ε > 0 in the sense that for any ε > 0 and any D > 1 we have

 −2/3+ε −D P max |λα| 2 + N N (11.1) α=1,...,N > 6

for any N > N0(ε, D). −2/3 −2/3+ε Proof. Set η = N and choose an energy E = 2 + κ outside of the spectrum with some κ > N  N ε/2η. From (6.33) with z = E + iη we have

N ε/2 1 | Im m (z) − Im m (z)|  (11.2) N sc 6 Nκ Nη with very high probability. On the other hand, if there is an eigenvalue λ with |λ − E| 6 η then we have 1 N ε/2 Im m (z)  Im m (z) + . (11.3) N > 2Nη sc Nκ Here the first inequality comes from 1 X η 1 η Im m (z) = , N N |λ − E|2 + η2 > N |λ − E|2 + η2 α α √ while the second follows from Im msc(z)  η/ κ from (6.13). The equations (11.2) and (11.3) however contradict each other, showing that with very high probability −2/3 there is no eigenvalue with |λ − E| 6 N . We can repeat this argument for a grid of energies E = Ek = 2 + N −2/3+ε + kN −2/3 for k = 0, 1,...,CN 2/3+D for any D finite. We can then use the union bound for the exceptional sets of small probabilities to exclude the existence of eigenvalues between N −2/3+ε and N D. P D Finally, for a large enough D, it is trivial to prove that the bound kHk 6 ij |hij| 6 N holds with very D high probability. This excludes eigenvalues |λ| > N and proves the corollary.

11.2 Stieltjes transform and regularized counting function

For any signed measure ρe on the real line, define its Stieltjes transform by Z ρ(λ) S(z) = e dλ, z ∈ \ . (11.4) λ − z C R R

76 The Helffer-Sj¨ostrandformula (that was originally used to develop an alternative functional calculus for R self-adjoint operators, see e.g., [36]) yields an expression for f(λ)ρe(λ)dλ for a large class of function f(λ) R 1 on the real line in terms of the Stieltjes transform of ρe. More precisely, let f ∈ C (R) with compact support and let χ(y) be a smooth cutoff function with support in [−1, 1], with χ(y) = 1 for |y| 6 1/2 and with bounded derivatives. Define fe(x + iy) := (f(x) + iyf 0(x))χ(y) to be an almost-analytic extension of f. With the standard convention z = x + iy and ∂z¯ = [∂x + i∂y]/2, we have Z 1 ∂z¯fe(x + iy) f(λ) = dxdy. (11.5) π 2 λ − x − iy R −1 To see this identity, we use ∂z¯(λ − z) = 0 to write Z Z 1 ∂z¯fe(x + iy) 1 h fe(x + iy) i dxdy = ∂z¯ dzd¯z. (11.6) π 2 λ − x − iy 2πi 2 λ − x − iy R R We can rewrite the last term as 1 Z h fe(x + iy) i d dz , (11.7) 2πi 2 λ − x − iy R where the operator d is the differential in the sense of one form. From the Green theorem and the compact support of fe, we can integrate by parts to a small circle Cε of radius ε around λ. The contribution of the two dimensional integration inside the circle will vanish in the limit ε → 0. Hence the last term is equal to

1 Z fe(x + iy) lim dz = f(λ) (11.8) ε→0 2πi Cε λ − x − iy and this proves the (11.5). Computing the ∂z¯ in (11.5) explicitly, we have

1 Z iyf 00(x)χ(y) + i(f(x) + iyf 0(x))χ0(y) f(λ) = dxdy. (11.9) π 2 λ − x − iy R

We remark that the same formula holds if we simply defined fe(x + iy) := f(x)χ(y). The addition of the 0 term iyf (x)χ(y) is to make ∂z¯fe(x + iy) = O(y) for |y| small which will be needed in the following estimates. In fact, the concept of almost-analytic function can be extended to arbitrary order n. We could have defined n fe such that ∂z¯fe(x + iy) = O(|y| ), but we will not need this result as it would not improve our estimate. The following lemma in a slightly less general forms appeared in [59, 68, 69]. It shows how to translate estimates on the Stieltjes transform S(z) in the regime Im z > ηe to the regularized distribution function of %e. Given two real numbers, E1 < E2, we wish to estimate %e([E1,E2]), the %e-measure of the interval [E1,E2]. Since the integral of the sharp indicator function cannot be directly controlled, we need to smooth it out; the function f2 − f1 in the lemma below plays this role. The scale of the smoothing, denoted by ηb below, must be larger than ηe. Another feature of this lemma is that the condition on S(z) = S(E + iη) is local; only E-values in a ηe-neighborhood of the endpoints E1,E2 are needed.

Lemma 11.2. Let %e be a signed measure on R and let S be its Stieltjes transform. Fix two energies E1 < E2. Suppose that for some 0 < ηe 6 1/2 and 0 < U1 6 U2 we have U |S(x + iy)| 1 for any x ∈ [E ,E + η] ∪ [E ,E + η] and η y 1, (11.10) 6 y 1 1 e 2 2 e e 6 6 U | Im S(x + iy)| 1 for any x ∈ [E ,E + η] and 1/2 y 1, (11.11) 6 y 1 2 e 6 6 U | Im S(x + iy)| 2 for any x ∈ [E ,E + η] ∪ [E ,E + η] and 0 < y < η. (11.12) 6 y 1 1 e 2 2 e e

77 For η η, define two functions f = f : → such that f (x) = 1 for x ∈ (−∞,E ], f (x) vanishes for b > e j Ej ,ηb R R j j j 0 −1 00 −2 x ∈ [Ej + η,b ∞); moreover |fj(x)| 6 C ηb and |fj (x)| 6 C ηb for some constant C. Then for some other constant C > 0 independent of U1, U2 and ηe, we have Z h η2 i (f − f )(λ)%(λ)dλ Ck%k U | log η| + U e , (11.13) 2 1 e 6 e TV 1 e 2 2 ηb where k · kTV denotes the total variation norm of signed measures. In addition, if we assume that ρe has compact support, then we also have

Z h η2 i f (λ)%(λ)dλ C k%k U | log η| + U e , (11.14) 2 e 6 e TV 1 e 2 2 ηb where C depends on the size of the support of %e. Proof. In this proof, we will consider only the case ηe = ηb and U1 = U2. We will denote these common values by η and U. We may also assume k%ekTV = 1. These simplifications only streamline some notation, the general case is proven exactly in the same way. Let f = f2 − f1 that is smooth and compactly supported. From the representation (11.9) and since f is real, we have Z ∞ Z ∞ ZZ 00 f(λ)ρe(λ)dλ = Re f(λ)ρe(λ)dλ 6 C yf (x)χ(y) Im S(x + iy)dxdy −∞ −∞ ZZ   + C |f(x)||χ0(y)| |Im S(x + iy)| + |y||f 0(x)||χ0(y)| |Re S(x + iy)| dxdy, (11.15)

for some universal C > 0, and where χ is a smooth cutoff function with support in [−1, 1], with χ(y) = 1 for 0 −1 0 |y| 6 1/2 and with bounded derivatives. Recall that f is O(η ) on two intervals of size O(η) each and χ is supported in 1/2 6 |y| 6 1. By (11.10), we can bound ZZ 0 0 |y||f (x)||χ (y)| |Re S(x + iy)| dxdy 6 CU (11.16)

(note that due to S¯(x + iy) = S(x − iy), the bounds analogous to (11.10)–(11.11) also hold for negative y’s). Using that f is bounded with compact support, we have, by (11.11), ZZ 0 |f(x)||χ (y)| |Im S(x + iy)| dxdy 6 CU. (11.17)

For the first term on the right hand side of (11.15), we split the integral into two regimes depending on 0 < y < η or η < y < 1. Note that by symmetry we only need to consider positive y. From (11.12), the integral on the first integration regime is easily bounded by ZZ 00 yf (x)χ(y) Im S(x + iy)dxdy 0

For the second integration regime, y ∈ [η, 1], we can integrate by parts in x, then use ∂x Im S(x + iy) = −∂y Re S(x + iy) and then integrate by parts in y to get: ZZ ZZ 00 0 yf (x)χ(y) Im S(x + iy)dxdy = − f (x)∂y(yχ(y)) Re S(x + iy)dxdy (11.18) y>η y>η Z − f 0(x)ηχ(η) Re S(x + iη)dx. (11.19)

78 0 0 −1 Recall that f (x) = 0 unless |x − Ej| < η for some j = 1, 2 and in this regime we have f = O(η ). By (11.10), the last integral is easily bounded by O(U). For the first integral, we use ∂y(yχ(y)) = O(1) and S(x + iy) = O(U/y) to have ZZ  Z 1  0 dy ∂y(yχ(y))f (x)S(x + iy)dxdy 6 O U = O (U| log η|) , (11.20) y>η η y which is (11.13). To prove (11.14), we simply choose E1 to be any energy on the left side of the support of ρ and notice that e Z Z

f2(λ)%(λ)dλ = (f2 − f1)(λ)%(λ)dλ CU| log η|. (11.21) e e 6 e This completes the proof of Lemma 11.2.

11.3 Convergence speed of the empirical distribution function Define the empirical distribution function (or, in physics terminology, integrated density of states)

1 Z E nN (E) := |{α : λα 6 E}| = %N (x) dx , N −∞

where %N is the empirical eigenvalue distribution

N 1 X % (x) = δ(x − λ ) . (11.22) N N α α=1 Similarly, we define the distribution function of the semicircle density Z E nsc(E) := %sc(x)dx. −∞ We introduce the differences

∆ ∆ % := %N − %sc , m := mN − msc .

1 R −1 and recall mN (z) = N Tr G(z) = %N (x)(x − z) dx is the Stieltjes transform of the empirical density. The following lemma, proven below, shows how local semicircle law implies a convergence rate for the empirical distribution function.

−1+ε Lemma 11.3. Suppose that for some |E| 6 10 and N 6 ηe  1 with some ε > 0 we know that 1 |m (x + iη) − m (x + iη)| ≺ (11.23) N sc Nη

holds for any |x − E| 6 ηe and any η ∈ [η,e 10]. Then we have

nN (E) − nsc(E) ≺ ηe . (11.24)

−1+ε For generalized Wigner matrices, the condition (11.23) holds with the choice ηe = N uniformly in |E| 6 10 (see (6.29) and (6.32)). Thus we have proved the following corollary: Corollary 11.4. For generalized Wigner matrices, we have

−1+ε sup nN (E) − nsc(E) 6 N (11.25) |E|610

−D with probability at least 1 − N for any D, ε > 0, if N > N0(ε, D) is sufficiently large.

79 Proof of Lemma 11.3. Since both the semicircle measure and the empirical measure (by Corollary 11.1) have a compact support within [−3, 3] with a very high probability, (11.24) clearly holds for E 6 −3 and ∆ ε for E > 3. For an energy |E| 6 3, we will apply Lemma 11.2 with the choice ρe = ρ , U1 = U2 = N ηe and E2 = E. The estimate (11.14) will immediately imply (11.24), provided the other assumptions on Lemma 11.2 are satisfied. For this purpose, we need the bounds (11.10), (11.11) and (11.12) on the Stieltjes transform with the ε ε choice U1 = U2 = N ηe. We denote this common value by U := N ηe. The assumption (11.23) implies (11.10) −1+ε (with very high probability) if we choose U > N . From Lemma 6.2 we find √ |Im msc(x + iy)| 6 C κx + y , (11.26)

recalling the definition κx = min{|x − 2|, |x + 2|} from (6.23). By spectral decomposition of H, it is easy to see that the function y 7→ y Im mN (x + iy) is monotone increasing. Thus we get, using (11.26) and (11.23), that p 1  p 1 y Im mN (x + iy) 6 ηeIm mN (x + iηe) ≺ ηe κx + ηe + ≺ ηe κx + ηe + , (11.27) Nηe N ∆ for y 6 ηe and |x| 6 10. Using the notation m := mN − msc and recalling (11.26), we therefore get p 1 |y Im m∆(x + iy)| ≺ η κ + η + Cη (11.28) e x e N 6 e 1 for y 6 ηe and |x| 6 10; here we have used the assumption ηe > N in the last step. Therefore, with the choice ε U = ηNe , the bounds (11.10), (11.11) and (11.12) indeed hold. Recall from Lemma 11.2 that f denotes a smooth indicator function of [−∞,E] on scale η, namely E,ηe e f (x) = 1 for x ∈ [−∞,E] and f (x) = 0 for x E + η and having derivative bounded by η−1. Using E,ηe E,ηe > e e (11.14) from Lemma 11.2, we conclude that Z ∆ fE,η(λ)(λ) % (λ) dλ ≺ η, (11.29) e e Hence we have

Z Z E+ηe Z E+ηe n (E) − n (E) − f (λ) %∆(λ) dλ = − f (λ) % (λ) dλ + f (λ) % (λ) dλ Cη . N sc E,ηe E,ηe N E,ηe sc 6 e E E

Notice that we only used the positivity of %N in the first term and the boundedness of f and %sc in the second. Conversely, we have Z n (E) − n (E)− f (λ) %∆(λ) dλ N sc E−η,e ηe Z E Z E = (1 − f )(λ) % (λ) dλ − (1 − f )(λ) % (λ) dλ −Cη . E−η,e ηe N E−η,e ηe sc > e E−ηe E−ηe Hence we have proved that Z Z −Cη + f (λ) %∆(λ) dλ n (E) − n (E) f (λ) %∆(λ) dλ + Cη . e E−η,e ηe 6 N sc 6 E,ηe e

The right hand side is directly bounded by (11.29), the left hand side is estimated in the same way, using (11.29) for f instead of f . This proves Lemma 11.3. E−η,e ηe E,ηe

11.4 Rigidity of eigenvalues We now state and prove the rigidity theorem concerning the eigenvalue locations for generalized Wigner matrices. Recall the definition of %N from (11.22). Hence we have

j Z λj = %N (x)dx. (11.30) N −∞

80 We define the classical location of the j-th eigenvalue by the equation j Z γj = %sc(x)dx, (11.31) N −∞

in other words, γj is the j-th N-quantile of the semicircle distribution. Theorem 11.5 (Rigidity of eigenvalues). For any generalized Wigner matrix ensemble, for any constant ε > 0 and any constant D > 1 we have

( −1/3 ) εh i −2/3 −D P ∃j : |λj − γj| > N min j, N − j + 1 N 6 N (11.32) for any sufficiently large N > N0(ε, D). Proof. We consider only the case j 6 N/2; the other case is similar. By (11.30) and (11.31) we have

Z λj Z γj Z λj Z γj 0 = ρN (x)dx − ρsc(x)dx = (ρN (x) − ρsc(x))dx − %sc(x)dx, (11.33) −∞ −∞ −∞ λj i.e., Z γj Z λj

%sc(x)dx 6 (ρN (x) − ρsc(x))dx = nN (λj) − nsc(λj) , λj −∞ and thus by the uniform bound (11.25) Z γj 1

%sc(x)dx ≺ . (11.34) λj N √ Since %sc(x)  κx for x ∈ [−2, 1], we have 3/2 0 1/3 nsc(x)  κx ,%sc(x) = nsc(x)  nsc (x), x ∈ [−2, 1] . (11.35)

This also implies for j 6 N/2 that  j 2/3  j 1/3 γ + 2  ,% (γ )  . (11.36) j N sc j N

If we knew that %sc(γj) and %sc(λj) were comparable, then the monotonicity of %sc and the mean-value theorem (11.34) would immediately imply

−1 |λj − γj|ρsc(γj) ≺ N . (11.37) Combining this with the second asymptotics form (11.36), we would have

−1 −1 −1 −1/3 |λj − γj| ≺ N %sc(γj) 6 CN (j/N) . (11.38) ε/2 −1+ε/2 For the complete proof, we first consider the indices j > j0 := N . Since nsc(γj) > N and nsc(γj) = nN (λj), we have −ε/4 nsc(λj) − nsc(γj) 6 N nsc(γj)

with very high probability by (11.25). This shows that nsc(λj)  nsc(γj), but then %sc(λj)  %sc(γj) by (11.35), as we presumed above. ε/2 Finally, for indices j 6 j0 = N we use monotonicity and rigidity (11.38) for the index j0: −2/3+ε/2 −2/3+ε λj 6 λj0 6 γj0 + N 6 −2 + N

with very high probability. In the last step we used (11.36). For the lower bound on λj we refer to (11.1). −2/3+ε −2/3+ε This shows that |λj + 2| 6 N with very high probability and we also have |γj + 2| 6 N , so −2/3+ε |λj − γj| 6 2N . This proves the rigidity estimate (11.32) for all indices j.

81 12 Universality for matrices with Gaussian convolutions 12.1 Dyson Brownian motion Consider the following matrix valued stochastic differential equation 1 1 dH(t)= √ dB(t) − H(t)dt, t > 0 (12.1) N 2

with initial data H0, where B(t) is defined somewhat differently in the two symmetry classes, namely:

(s) (s) (i) in the real symmetric case (indicated by superscript) B is an N × N matrix such that bij for i < j (s) √ (s) (s) and bii / 2 are independent standard Brownian motions, and bij = bji . √ √ (h) (h) (h) (ii) in the complex Hermitian case B is an N × N matrix such that 2 Re(bij ), 2 Im(bij ) for i < j (h) (h) (h) ∗ and bii are independent standard Brownian motions, and bji = (bij ) . We will drop the s or h superscript, the following formulas hold for both cases. In coordinates, we have

dbij(t) 1 dhij(t) = √ − hij(t)dt, (12.2) N 2

where the entries bij(t) have variance t in the Hermitian case and in the symmetric case the off-diagonal entries (bij(t)) have variance t, while the diagonal entries have variance 2t. The equation (12.1) defines a matrix valued Ornstein-Uhlenbeck (OU) process. It plays a distinguished role in the theory of random matrices that was discovered by Dyson. Depending on the typographical convenience, the time dependence will sometimes be indicated as an argument, H(t), and sometimes as an index, Ht, we keep both notations in parallel, i.e., H(t) = Ht. In particular, unlike in some PDE literature, subscripts do not mean derivatives. Next, we are concerned about the evolution of the eigenvalues λ(t) = (λ1(t), λ2(t), . . . , λN (t)) of Ht along the OU flow. We label the eigenvalues in an increasing order, i.e., we assume that λ(t) ∈ ΣN , where we define the open simplex N ΣN := {λ ∈ R : λ1 < λ2 < . . . < λN }. (12.3) A theorem below will guarantee that the eigenvalues are simple and they are continuous functions of t, hence the labelling is consistently preserved along the evolution. In principle, the eigenvalues {λj(t)} and eigenvectors {uj(t)} of Ht are strongly related and one would expect a coupled system of stochastic differential equations for them. It was Dyson’s fundamental observation [45] that the eigenvalues themselves satisfy an autonomous system of stochastic differential equations (SDE) that do not involve the eigenvectors. This SDE is given in the following definition.

Definition 12.1. Given a real parameter β > 1, consider the following system of SDE

√  (i)  2 λi 1 X 1 dλi = √ dBi + − +  dt , i ∈ 1,N , (12.4) βN 2 N λ − λ j i j J K

where (Bi) is a collection of real-valued, independent standard Brownian motions. The solution of this SDE is called the Dyson Brownian Motion (DBM) with parameter β.

Notice that we defined the DBM for any β > 1 and not only for the classical values β = 1, 2 that will correspond to an OU matrix flow. The following theorem summarizes the main properties of the DBM. For the proof, see Lemma 4.3.3 and Proposition 4.3.5 of [8]. We remark that the authors in [8] considered (12.4) 0 2 without the drift term −λi/2. However, any drift term of the form −V (λi) with V ∈ C (R) could be added without further complications.

82 Theorem 12.2. Let β > 1 and suppose that the initial data satisfy λ(0) ∈ ΣN . Then there exists a unique (strong) solution to (12.4) in the space of continuous functions (λ(t))t>0 ∈ C(R+, ΣN ). Furthermore, for any t > 0 we have λ(t) ∈ ΣN and λ(t) depends continuously on λ(0). In particular, if λ(0) ∈ ΣN , i.e., the

multiplicity of the initial points is one, then (λ(t))t>0 ∈ C(R+, ΣN ), i.e., this property is preserved for all times along the evolution. We point out that the solution is considered in the strong sense. This means that despite the singularity in (12.4), the DBM admits a solution for almost all realization of the Brownian motions Bi. For the precise definition we recall the concept of strong solution to a (scalar) SDE of the form

dXt = a(Xt)dBt + b(Xt)dt, X0 = ξ, t > 0, (12.5)

with respect to a fixed realization of a Brownian motion Bt, where a and b are given coefficient functions.

The strong solution is a process (Xt)t>0 on a filtrated probability space (Ω, Ft) with continuous sample paths that satisfies the following properties

• Xt is adapted to the filtration Ft = σ(Gt ∪ N ), where

Gt = σ(Bs, s 6 t, X0), N := {N ⊂ Ω, ∃G ⊂ G∞ with N ⊂ G, P(G) = 0},

i.e., the filtration of Xt is the same as that of Bt after completing it with events of zero measure;

• X0 = ξ almost surely; R t  2 2 • For any time t, we have 0 |b(Xs)| + |a(Xs)| ds < ∞ almost surely; • The integral version of (12.5), i.e., Z t Z t Xt = X0 + a(Xs)dBs + b(Xs)ds 0 0 holds almost surely for any t > 0.

For simplicity we only presented the definition for the scalar case, Xt ∈ R; the extension to the vector valued N case, Xt ∈ R , is straightforward. The relation between the OU matrix flow (12.1) and the DBM (12.4) is given by the following theorem essentially due to Dyson.

Theorem 12.3. [45] Let Ht solve the matrix valued SDE (12.1) in a strong sense. Then its eigenvalue process satisfies (12.4), where the parameter β = 1 if the matrix B is real symmetric and β = 2 if B is complex Hermitian. 1 We remark that the Ornstein-Uhlenbeck drift term − 2 Htdt in (12.1) is not essential. One may replace it 0 2 with −V (H)dt with any V ∈ C (R) potential, then the same theorem holds but the −λi/2 term has to be 0 replaced with −V (λi) in (12.4). In particular, the drift term can be removed; for example the monograph [8] defines DBM without the drift term. While there is no fundamental difference between these formulations, (12.1) is technically slightly more convenient since it preserves not only the zero expectation value but also the 1/N variance of the matrix elements if the variance of the initial data H0 is 1/N. −1 If technical complications related to the singularities (λi − λj) in (12.4) could be neglected, the proof of Theorem 12.3 would be a relatively straightforward Ito calculus combined with standard perturbation formulas for eigenvalues and eigenvectors of self-adjoint matrices. We will present this calculation in the next section. A rigorous proof is slightly more cumbersome since a priori the L2 integrability of the singularity is not guaranteed. The actual proof first regularizes the singular term on a very short scale. Then a stopping time argument shows that the regularization can be removed since the dynamics has an effective level repulsion mechanism that keeps neighboring points separated. We will not present this argument here since it is not particularly instructive for this book. The full details can be found in Section 4.3.1 of [8] for 1 the case when the term − 2 Htdt is not present in (12.1). The necessary modifications for (12.1) (or for other 0 drift term −V (λi)) are straightforward.

83 12.2 Derivation of Dyson Brownian motion and perturbation theory

2 Let uα, α = 1, 2,...,N, be the ` -normalized eigenvectors of H = (hij) with (real) eigenvalues λα. For simplicity we assume that all eigenvalues are distinct. Fix an index pair (i, j) and abbreviate the partial derivative as ∂f f˙ := . ∂hij ∗ Differentiating Huα = λαuα and uαuβ = δα,β yields ˙ H˙ uα + Hu˙ α = λαuα + λαu˙ α, (12.6)

as well as ∗ ∗ ∗ u˙ αuβ + uαu˙ β = 0, u˙ αuα = 0.

Taking the inner product with uα on both side of (12.6), we get

˙ ∗ ˙ λα = uαHuα. (12.7)

∗ Moreover, multiplying (12.6) by uβ yields, for β 6= α,

∗ ˙ ∗ ∗ uβHuα + uβHu˙ α = λαuβu˙ α,

i.e., ∗ ˙ ∗ ∗ uβHuα + λβuβu˙ α = λαuβu˙ α,

where we have used that H is self-adjoint and λβ is real in the last equation. Hence,

∗ ˙ X ∗ X uβHuα u˙ α = (uβu˙ α)uβ = uβ. (12.8) λα − λβ β6=α β6=α

For simplicity of notations, from now on, we consider only the real symmetric case. The complex Her- mitian case can be treated similarly. In the real symmetric case, (12.7) reads as

∂λα = uα(i)uα(j)[2 − δij], (12.9) ∂hij

where uα(1), uα(2), . . . , uα(N) denote the coordinates of the vector uα. We may assume that the eigenvectors are real. From (12.8) we get

∂uα(k) X uβ(i)uα(j) + uβ(j)uα(i)[1 − δij] = uβ(k). (12.10) ∂hij λα − λβ β6=α

Combining these last two formulas allows us to compute the second partial derivatives, i.e., for any fixed indices i, j, k, ` we have

2 ∂ λα h∂uα(i) ∂uα(k)i =[2 − δik] uα(k) + uα(i) ∂h`j∂hik ∂h`j ∂h`j X 1 h  =[2 − δik] uβ(j)uα(`) + uβ(`)uα(j)[1 − δj`] uβ(i)uα(k) λα − λβ β6=α (12.11)  i + uβ(`)uα(j) + uβ(j)uα(`)[1 − δj`] uβ(k)uα(i) X 1 h  i =[2 − δik] uβ(j)uα(`) + uβ(`)uα(j)[1 − δj`] uβ(i)uα(k) + uβ(k)uα(i) . λα − λβ β6=α

84 By (12.2), (12.9), (12.11) and using Ito’s formula (neglecting the issue of singularity), we have

2 X ∂λα 1 X X ∂ λα dλα = dhik + (dhik)(dh`j) ∂hik 2 ∂hik∂h`j i6k i6k j6` X hdbik hik i = u (i)u (k) √ − dt α α 2 i,k N (12.12) 1 X X 1 h 2 2 2 2i + |uβ(i)| |uα(k)| + |uα(i)| |uβ(k)| dt 2N λα − λβ i,k β6=α

1 X λα 1 X 1 =√ uα(i)uα(k)dbik − dt + dt. 2 N λα − λβ N i,k β6=α

1 In the second line we used bik = bki for i 6= k and that (dhik)(dh`j) = N δi`δkj[1 + δik]dt. Finally, in the last line we have used the equation Huα = λαuα to get

X X X 2 uα(i)uα(k)hik = uα(i)(Huα)(i) = λα |uα(i)| = λα, i,k i i

which gives the −(λi/2)dt term in (12.4). In the first term on the right hand side of (12.12) we define a new real Gaussian process X Beα := uα(i)uα(k)bik. i,k

Clearly, EdBeα = 0 and its covariance satisfies X X X EdBeαdBeα0 = E uα(i)uα(k)dbikuα0 (`)uα0 (j)db`j = 2 uα(i)uα0 (i)uα(k)uα0 (k)dt = δαα0 2dt. i,k `,j i,k

where√ there are two pairings, (i, `) = (k, j) and (i, `) = (j, k), that contributed to the contractions. Thus N N Beα = 2Bα, where (Bα)α=1 is the standard (real) Brownian motion in R and this gives the martingale term in (12.4) with β = 1. A similar formula can be derived for the Hermitian case, where the parameter becomes β = 2; we omit the details. In the next section we put the Dyson Brownian motion in a more general context which is closer to the interpretation of GUE as an invariant ensemble in the spirit of Section 4. It turns out that the measure on the eigenvalues, explicitly given in (4.3), generates a DBM in a canonical way such that this measure will be invariant under the dynamics. This holds for any values β > 1, i.e., even if there is no underlying matrix ensemble.

12.3 Strong local ergodicity of the Dyson Brownian motion One key property of DBM is that it leaves the Gaussian Wigner ensembles invariant. Moreover, the Gaussian measure is the only equilibrium of DBM and the DBM dynamics converges to this equilibrium from any initial condition. In this section we will quantify these ideas. We start with introducing the concept of the Dirichlet form and the generator associated with any N probability measure µ = µN on R . For definiteness, the reader may keep the invariant β-ensemble (4.3) in mind, i.e., N e−βNHN (λ) 1 X 1 X µ (dλ) = dλ, H (λ) := V (λ ) − log |λ − λ |. (12.13) N Z N 2 i N j i i=1 i

85 We define the Dirichlet form associated with the measure µ on RN , which is just the homogeneous H1 norm, i.e.,

N Z 1 X 2 1 2 D (f) := (∂ f) dµ = k∇fk 2 , (∂ ≡ ∂ ) . (12.14) µ βN i βN L (µ) i λi i=1

We remark that in our earlier papers the Dirichlet form was defined with a 1/(2N) prefactor instead of 1/(βN). The current convention is suitable to consider DBM for general β. The symmetric operator associated with the Dirichlet form is called generator and denoted by L = Lµ. It satisfies Z Dµ(f) = hf, (−L)fiL2(µ) = − fLfdµ.

Notice that we follow the probabilistic convention to define the generator as a negative operator, L 6 0. 1 Formally, we have L = βN ∆ − (∇H) · ∇, i.e.,

N N (i) X 1 X  1 1 X 1  L = ∂2 + − V 0(λ ) + ∂ . (12.15) βN i 2 i N λ − λ i i=1 i=1 j i j

1 2 In the special Gaussian case, when the confining potential is V (λ) = 2 λ , the corresponding measure N restricted to ΣN ⊂ R is denoted by µG. Notice that for β = 1, 2 the measure µG coincides with the GOE, GUE measures, respectively (see also (4.12)). The generator (12.15) reads as

N N (i) X 1 X  1 1 X 1  L = ∂2 + − λ + ∂ . (12.16) G βN i 2 i N λ − λ i i=1 i=1 j i j

Now we consider the Dyson Brownian motion (12.4), i.e., dynamics of the eigenvalues λ = (λ1, λ2, . . . , λN ) ∈ ΣN of Ht that evolves by (12.1). We write the distribution of λ of Ht at time t as ft(λ)µG(dλ). Comparing the SDE (12.4) with LG, we notice that Kolmorogov’s forward equation for the evolution of the density ft takes the form

∂tft = LGft , (t > 0) . (12.17)

We remark that the rigorous definition of L as a self-adjoint operator on L2(dµ) through the Friedrichs extension and the existence of the corresponding dynamics (12.17) restricted to the simplex ΣN for any β > 1 will be discussed in the separate Section 12.4. In particular, by spectral theorem the operator (−L)1/2etL is 2 1 bounded for t > 0, thus we see that for any initial condition f0 ∈ L (dµ) we have ft ∈ H (dµ) for any t > 0. The restriction β > 1 is essential to define the dynamics on the simplex ΣN without specifying additional boundary conditions; the same restriction was also necessary in Theorem 12.2 to define the strong solution to the DBM as a stochastic differential equation. Indeed, if β < 1, the particles cross each other in the SDE formulation. In this section we thus assume β > 1. Nevertheless, some ideas and results presented here are still applicable to any β > 0 with a regularization; we will comment on this in Section 13.7. By writing the distribution of the eigenvalues as ftµG, our formulation of the problem has already taken into account Dyson’s observation that the invariant measure for DBM is µG since as we will see, at t → ∞ the solution ft converges to f∞ = 1. A natural question regarding the DBM is how fast the dynamics reaches equilibrium. Dyson had already posed this question in 1962: Dyson’s conjecture [45]: The global equilibrium of DBM is reached in time of order one and the local equilibrium (in the bulk) is reached in time of order 1/N. Dyson further remarked, “The picture of the gas coming into equilibrium in two well-separated stages, with microscopic and macroscopic time scales, is suggested with the help of physical intuition. A rigorous proof that this picture is accurate would require a much deeper mathematical analysis.”

86 We will prove that Dyson’s conjecture is correct if the initial data of the DBM is the eigenvalues of a Wigner ensemble, which was Dyson’s original interest. Our result in fact is valid for DBM with much more general initial data that we now survey. Briefly, it will turn out that the global equilibrium is indeed reached within a time of order one, but local equilibrium is achieved much faster if an a priori estimate on the location of the eigenvalues (also called points) is satisfied. Recalling the definition of the classical locations γj from (11.31), the a priori bound (originally referred to Assumption III in [63, 64]) is formulated as follows

A priori Estimate: There exists a ξ > 0 such that average rigidity on scale N −1+ξ holds, i.e., N 1 Z X Q = Q := sup (λ − γ )2f (λ)µ (dλ) CN −2+2ξ (12.18) ξ N j j t G 6 06t6N j=1 with a constant C uniformly in N. (We assumed (12.18) with a lower bound t > 0 in the supremum, but in −1+2ξ fact the proof shows that t > N would be sufficient.) Notice that (12.18) requires that the rigidity of the eigenvalues on scale N −1+ξ holds, at least in an average sense. We recall that Theorem 11.5 guarantees that for Wigner eigenvalues rigidity holds on scale N −1+ξ for any ξ > 0 (in the bulk), so (12.18) holds for any ξ > 0 in this case. However, our theory on the local ergodicity of the DBM applies to other models as well, so we wish to keep our formulation general and allowing situations where (12.18) holds only for larger ξ. The main result on the local ergodicity of Dyson Brownian motion (12.17) states that if the a priori estimate (12.18) is satisfied then the local correlation functions of the measure ftµG are the same as the −1+2ξ corresponding ones for the Gaussian measure, µG = f∞µG, provided that t is larger than N . The n-point correlation functions of the (symmetrized) probability measure ν are defined, similarly to (4.20), by Z (n) pν,N (x1, x2, . . . , xn) := ν(x)dxn+1 ... dxN , x = (x1, x2, . . . , xN ). (12.19) N−n R In particular, when ν = ftdµG, we denote the correlation functions by p(n) (x , x , . . . , x ) = p(n) (x , x , . . . , x ). (12.20) t,N 1 2 n ftµG,N 1 2 n We also use p(n) for p(n) . In general, if H is an N × N Wigner matrix, then we will also use p(n) for the G,N µG,N H,N correlation functions of the eigenvalues of H. Due to the convention that one can view the locations of eigenvalues as the coordinates of particles (or points), we have used x, instead of λ, in the last equation. From now on, we will use both conventions de- pending on which viewpoint we wish to emphasize. Notice that the probability distribution of the eigenvalues at the time t, ftµG, is the same as that of the Gaussian divisible matrix: −t/2 −t 1/2 G Ht = e H0 + (1 − e ) H , (12.21) G where H0 is the initial Wigner matrix and H is an independent standard GUE (or GOE) matrix. This establishes the universality of the Gaussian divisible ensembles. The precise statement is the following theorem (notice that [64] uses somewhat different notation). 1 Theorem 12.4. [64, Theorem 2.1] Suppose that for some exponent ξ ∈ (0, 2 ), the average rigidity (12.18) −1+ξ holds for the solution ft of the forward equation (12.17) on scale N . Additionally, suppose that in the bulk the rigidity holds on scale N −1+ξ even without averaging, i.e., for any κ > 0 −1+ξ sup |λj − γj| ≺ N (12.22) κN6j6(1−κ)N −1+2ξ holds for any t ∈ [N ,N] if N > N0(ξ, κ) is large enough. Let E ∈ (−2, 2) and b = bN > 0 such that [E − b, E + b] ⊂ (−2, 2). Then for any integer n > 1 and for any compactly supported smooth test function O : Rn → R, we have, for any t ∈ [N −1+2ξ,N], Z E+b dE0 Z  α  hN −1+ξ r 1 i (n) (n)  0 ε dα O(α) pt,N − pG,N E + 6 N + kOkC1 (12.23) 2b n N b bNt E−b R for any N sufficiently large, N > N0(n, ξ, κ).

87 The upper limit t 6 N for the time interval in (12.18) and within the theorem is unimportant, one could ε ε have replaced it with N for any ε > 0. Beyond t > N the measure ftµG is already exponentially close to equilibrium µG in entropy sense, so all correlation functions are also close. Also notice that for simplicity of the notation we replaced α by α in the argument of the correlation functions (compare (12.23) with N%sc(E) N (5.2)). This can be easily achieved by a change of variables and a redefinition of the test function O. This convention will be followed in the rest of the book. −1+ξ 1 In other words, this theorem states that if we have rigidity on scale N for some ξ ∈ (0, 2 ), then the DBM has average energy universality in the bulk for any time t  N −1+2ξ on scale b  max{N −1+ξ, (Nt)−1}. For generalized Wigner matrices we know that rigidity holds on the smallest possible scale, so we have the following corollary:

Corollary 12.5. Consider the matrix OU process with initial matrix ensemble H0 being a generalized Wigner ensemble and let |E| < 2. Then for any ε > 0, for any t ∈ [N −1+2ε,N ε] and for any b (Nt)−1 with > b < |E| − 2 we have

Z E+b dE0 Z  α  N 3ε (n) (n)  0 dα O(α) pt,N − pG,N E + 6 √ kOkC1 , (12.24) 2b n N E−b R bNt

for any N N (n, ε), where p(n) is the correlation function of H , or equivalently, that of e−t/2H + √ > 0 t,N t 0 1 − e−tHG. √ −t/2 −t G Proof. Since H0 is a generalized Wigner ensemble, so is e H0 + 1 − e H for all t. Hence the optimal rigidity estimate (11.32) holds for all times. Therefore, (12.18) and (12.22) hold with any ξ > 0. Choose ξ = ε/2 and apply Theorem 12.4. The right hand side of (12.23) is bounded by

ε/2 3ε εhN 1 i N N + √ 6 √ , Nb bNt bNt

ε where we have used that t 6 N and bNt > 1. The estimate (12.24) means that averaged energy universality holds in a window of size slightly bigger than tN. A large energy window of size b = N −δ, for some small δ > 0, can already be achieved after a very −1+2δ short times, i.e., for any t > N (by choosing ε = δ/8). To achieve universality on very short energy −1+δ −δ/2 scales, b = N , we will need relatively large times, t > N (by choosing ε = δ/16). In both cases the right hand side (12.24) is bounded by CN −ε. In the sense of universality of large energy windows of size b = N −δ, our result essentially establishes the Dyson’s conjecture that the time to local equilibrium is N −1. A better error estimate can also be achieved if both b and t are large, e.g., with the choice b = N −δ and −δ −1/2+δ+ε t > N , we get a convergence of order N for any ε > 0. Going back to Theorem 12.4, we also remark that there is a considerable room in this argument even if optimal rigidity is not available (this is the case for random band matrices, see [55]). In fact, Theorem 12.4 −δ provides universality, albeit only for relatively large times t > N and with a large energy averaging, −δ 1+δ b = N , even if (12.18) and (12.22) hold only with ξ = 2 . This restriction comes from the requirement −1+2ξ −δ that t > N should be implied by t > N . This exponent ξ slightly above 1/2 corresponds to a rigidity control on a scale just below N −1/2 in the bulk. Clearly, N −1/2 is a critical threshold; no local universality can be concluded with this argument unless a rigidity control on scale below N −1/2 is established a priori.

12.4 Existence and restriction of the dynamics This technical section was taken from Appendix A of [64]; we include it here for the reader’s convenience. As in Section 12.3, we consider the Euclidean space RN with the normalized measure µ = exp(−βNH)/Z. The Hamiltonian H, of the form (12.13), is symmetric with respect to the permutation of the variables x = N (x1, . . . , xN ), thus the measure can be restricted to the subset ΣN ⊂ R defined in (12.3). In this appendix 1 1 we outline how to define the dynamics (12.17) with its generator, formally given by L = 2N ∆ − 2 (∇H)∇,

88 Q β on ΣN . The condition β > 1 and the specific factors i 0, on L (Theorem 1.4.1 [74]) and it can be extended to a contraction 1 N semigroup to L (R , dµ), kTtfk1 6 kfk1 (Section 1.5 [74]). The generator L of the semigroup, is defined via the Friedrichs extension (Theorem 1.3.1 [74]) and it is a positive self-adjoint operator on its natural domain ∞ 1 1 D(L) with C0 being the core. The generator is given by L = 2N ∆ − 2 (∇H)∇ on its domain (Corollary 2 2 1.3.1 [74]). By the spectral theorem, Tt maps L into D(L), thus with the notation ft = Ttf for some f ∈ L , it holds that ∂tft = Lft, t > 0, and lim kft − fk2 = 0. (12.25) t→0+ 2 1 Moreover, by approximating f by L functions and using that Tt is contraction in L (Section 1.5 in [74]), the differential equation holds even if the initial condition f is only in L1. In this case the convergence + 1 ∞ ft → f, as t → 0 , holds only in L . We remark that Tt is also a contraction on L , by duality. Now we restrict the dynamics to Σ = ΣN equipped with the probability measure µΣ(dx) := N!µ(dx) · 1(x ∈ Σ). Here µN is from (12.13) and the factor N! restores the normalization after restriction to Σ. N (Σ) Repeating the general construction with R replaced by ΣN , we obtain the corresponding generator L (Σ) 1 ∞ and the semigroup Tt on the space H (Σ, dµΣ) that is defined as the closure of C0 (Σ) with respect to 2 2 the norm k · k+ = E(·, ·) + k · k2 on Σ. To establish the relation between L and L(Σ), we first define the symmetrized version of Σ

N n o Σe := R \ x : ∃i 6= j with xi = xj .

∞ 1 N Denote X := C0 (Σ).e The key information is that X is dense in H (R , dµ) which is equivalent to the ∞ N density of X in C0 (R , dµ). We will check this property below. Then the general argument above directly N applies if R is replaced by ΣeN and it shows that the generator L is the same (with the same domain) if we ∞ N start from X instead of C0 (R , dµ) as a core. Note that both L and L(Σ) are local operators and L is symmetric with respect to the permutation of the variables. For any function f defined on Σ, we define its symmetric extension onto Σe by fe. Clearly, (Σ) ∞ Lfe = L^f for any f ∈ C0 (Σ). Since the generator is uniquely determined by its action on its core, and the 1 (Σ) generator uniquely determines the dynamics, we see that for any f ∈ L (Σ, dµ), one can determine Tt f by computing Ttfe and restricting it to Σ. In other words, the dynamics (12.17) is well defined when restricted to Σ = ΣN . ∞ N ∞ N Finally, we have to prove the density of X in C0 (R , dµ), i.e., to show that if f ∈ C0 (R ), then there ∞ exists a sequence fn ∈ C0 (Σ)e such that E(f − fn, f − fn) → 0. The structure of Σe is complicated since in + addition to the one codimensional coalescence hyperplanes xi = xj (and xi = 0 in case of Σ ), it contains higher order coalescence subspaces with higher codimensions. We will show the approximation argument in a neighborhood of a point x such that xi = xj but xi 6= xk for any other k 6= i, j. The proof uses the fact that the measure dµ vanishes at least to first order, i.e., at least as |xi − xj|, around x, thanks to β > 1. This is the critical case; the argument near higher order coalescence points is even easier, since they have lower codimension and the measure µ vanishes at even higher order. In a neighborhood of x we can change to local coordinates such that r := xi −xj remains the only relevant ∞ coordinate. Thus the task is equivalent to show that any g ∈ C0 (R) can be approximated by a sequence ∞ gε ∈ C0 (R \{0}) in the sense that Z 0 0 2 |g (r) − gε(r)| |r|dr → 0 (12.26) R

89 as ε → 0 over a discrete set, e.g. ε = εn = 1/n. It is sufficient to consider only the positive semi-axis, i.e., r > 0. More precisely, define gε(r) := g(r)φε(r), (12.27) 2 where φε is a continuous cutoff function defined as φε(r) = 0 for r 6 ε , φε(r) = 1 for r > ε and 1 φ (r) = C (log r − 2 log ε), ε2 r ε, C = = | log ε|−1. (12.28) ε ε 6 6 ε log ε − log ε2 Simple calculation shows that Z ∞ Z ∞ Z ∞ 0 0 2 0 2 2 2 0 2 |g (r) − gε(r)| rdr 6 C |g (r)| |1 − φε(r)| rdr + C |g(r)| |φε(r)| rdr. (12.29) 0 0 0

1 The first term converges to zero as ε → 0 by dominated convergence since g ∈ H (R+, rdr) and φε → 1 pointwise. The second term is bounded by

Z ∞ Z ε 1 2 kgk2 2 0 2 2 2 ∞ |g(r)| |φε(r)| rdr 6 Cε kgk∞ rdr = , 0 ε2 r | log ε|

∞ 1 which also tends to zero since g ∈ L . This shows that gε converges to g in H . The function gε is not 2 yet smooth, but we can smooth it on a scale δ  ε by taking the convolution with a smooth function ηδ 1 compactly supported in [−δ, δ]. It is an elementary exercise to see that ηδ ?gε converges to gε in H as δ → 0. ∞ 1 ∞ This verifies (12.26). The same proof shows that C0 (R+ \{0}) is dense in H (R+, rdr)∩L (R+, rdr). This 1 completes the proof that for β > 1 the dynamics is well defined in the space H (Σ, dµΣ). ∞ In fact, the boundedness condition f ∈ L (R+, rdr) can be removed in the above argument and one can ∞ 1 show that C0 (R+ \{0}) is dense in H (R+, rdr). This is a type of Meyers-Serrin theorem concerning the ∞ equivalence of two possible definitions of Sobolev spaces on the measure space (R+, rdr): the closure of C0 functions coincides with the set of functions with finite H1-norm. Another formulation is the following: If we extend the functions on R+ to two dimensional radial functions, i.e., G(x) := g(|x|), Gε(x) = gε(|x|) for x ∈ R2, then this statement is equivalent to the fact that a point in two dimensions has zero capacity. To see this stronger claim, we need a different cutoff function in (12.27). Let us now define

2 φε(r) := log(a + b log r), ε 6 r 6 ε, 2 with b := (1 − e)/ log ε, a := 2e − 1, and φε(r) := 0 for r < ε , and φε(r) := 1 for r > ε. The first term on the right hand side of (12.29) goes to zero as before. For the last term in (12.29), we have

Z ∞ Z ε 2 2 0 2 2 |g(r)| ) |g(r)| |φε(r)| rdr = b 2 2 rdr. 0 ε2 r (a + b log r)

We view this as an integral on R2, by radially extending the function g to R2 by G(y) := g(|y|) for any y ∈ R2. Then the LHS above gives Z ε 2 Z 2 Z 2 2 |g(r)| 2 |G(y)| |G(y)| b 2 2 rdr = b 2 2 dy 6 C 2 2 dy. 2 r (a + b log r) 2 |y| (a + b log |y|) 2 |y| (log |y|) ε ε 6|y|6ε ε 6|y|6ε

−2k 2 Define the sequence εk := 2 , clearly εk = εk+1. Setting Z |G(y)|2 I := dy, k |y|2(log |y|)2 εk+16|y|6εk we clearly have

∞ Z 2 Z Z ∞ X |G(y)| 2 0 0 2 Ik = 2 2 dy 6 C |∇G| = C |g (r)| rdr, |y| (log |y|) 2 k=1 |y|61/2 R 0

90 where the last step we used the critical Hardy inequality with logarithmic correction (see, e.g. Theorem 2.8 1 + in [48]). Since in our case g ∈ H (R , rdr), from the last displayed inequality we have Ik → 0 as k → ∞.

Therefore, we can use the cutoff function φεk to prove the claim. N The Meyers-Serrin theorem we just proved for (R+, rdr) can be extended to (R , dµ) using local changes (Σ) 1 of coordinates if β > 1. Thus one may prove that the form domain of the operator L is H (Σ, dµΣ) and the dynamics (12.25) is well defined on this space.

91 13 Entropy and the logarithmic Sobolev inequality (LSI)

13.1 Basic properties of the entropy In this section, we prove a few well-known inequalities for the entropy which will be used in next section for the logarithmic Sobolev inequality (LSI). In this section we work on an arbitrary probability space (Ω, B) with an appropriate σ-algebra. We will use the probabilistic and the analysis notations and concepts in parallel. In particular, we interchangeably use the concept of random variables and measurable functions on µ R Ω. Similarly, both the expectation E [f] and the integral notation Ω fdµ will be used. We will drop the Ω in the notation. Definition 13.1. For any two probability measures ν, µ on the same probability space, we define the relative entropy of ν w.r.t. µ by Z dν Z dν dν S(ν|µ) := log dν = log dµ (13.1) dµ dµ dµ if ν is absolutely continuous with respect to µ and we set S(ν|µ) := ∞ otherwise. Applying the definition (13.1) to ν = fµ for any probability density f with R fdµ = 1, we may also write Z Sµ(f) := S(fµ|µ) = f(log f)dµ.

In most cases, the reference measure µ will be canonical (e.g. the natural equilibrium measure), so often we will drop the subscript µ if there is no confusion. We may then call Sµ(f) = S(f) the entropy of f. Using the convexity of the function x 7→ x log x on R+, a simple Jensen’s inequality,  Z   Z  Z 0 = fdµ log fdµ 6 f log fdµ, shows that the relative entropy is always non-negative, S(ν|µ) > 0.

Proposition 13.2 (Entropy inequality or Gibbs inequality). Let µ, ν be two probability measures on a common probability space and let X be a random variable. Then, for any positive number α > 0, the following inequality holds as long as the right hand side is finite.

ν −1 −1 µ αX E [X] 6 α S(ν|µ) + α log E e . (13.2) Proof. First note that we only have to prove the case α = 1 since we can redefine αX → X. From the concavity of the logarithm and Jensen’s inequality, we have Z Z h dµi Xdν − S(ν|µ) = log eX dν log ν eX , dν 6 E and this proves (13.2). Although (13.2) is just an inequality, there is a variational characterization of the relative entropy behind it. Namely, h ν µ X i S(ν|µ) = sup E [X] − log E e , (13.3) X where the supremum is over all bounded random variables. As we will not need this relation in this book, we leave it to the interesting readers to prove it. As a corollary of (13.2) we mention that for any set A we have the bound

ν log 2 + S(ν|µ) P (A) 6 1 . (13.4) log µ P (A)

92 The proof is left as an exercise. This inequality will be used in the following context. Suppose that the relative entropy of ν with respect to µ is finite. Then in order to show that a set has a small probability w.r.t ν, we need to verify that this set is exponentially small w.r.t. µ. In this sense, entropy provides only a relatively weak link between two measures. However, it is still stronger than the total variational norm which we will show now. For any two probability measures fµ and µ, the Lp distance between fµ and µ is defined by h Z i1/p |f − 1|pdµ . (13.5)

When p = 1, it is called the total variational norm between fµ and µ. Entropy is a weaker measure of distance between two probability measures than the Lp distance for any p > 1 but stronger than the total variational norm. The former statement can be expressed, for example, by the elementary inequality Z h Z i1/p 2 Z f log fdµ 2 |f − 1|pdµ + |f − 1|pdµ, p > 1, (13.6) 6 p − 1 whose proof is left as an exercise. (There is a more natural way to compare Lp distance and the relative entropy in terms of the L log L Orlicz norm but we will not use this norm in this book.) The latter statement will be made precise in the following proposition. Furthermore, we also remark the following easy relation among Lp norms and the entropy: d h Z i1/p Z f pdµ = f log fdµ (13.7) dp p=1 holds for any probability density f w.r.t. µ.

R Proposition 13.3 (Entropy and total variation norm, Pinsker inequality). Suppose that fdµ = 1 and f > 0. Then we have h Z i2 Z 2 |f − 1|dµ 6 f log fdµ. (13.8)

Proof. By variational principle, we first rewrite Z Z Z |f − 1|dµ = sup fgdµ − gdµ. (13.9) |g|61 For any such function g, we have by the entropy inequality (13.2) that for any t > 0 Z Z Z Z Z −1 tg −1 fgdµ − gdµ 6 t log e dµ − gdµ + t f log fdµ. (13.10)

We now define the function Z h(t) := log etgdµ for any t > 0. A simple calculation shows that the second derivative of h is given by etgµ h00(t) = hg; gi , ω := , ωt t R etgdµ

where ωt is a probability measure and Z  Z  Z  hf; giω := fgdω − fdω gdω

denotes the covariance. Recall that the covariance is positive definite, i.e., it satisfies the usual Schwarz inequality, 2 |hf; giω| 6 hf; fiωhg; giω.

93 00 0 2 Since |g| 6 1, we have 0 6 hg; giωt 6 1, so h (t) 6 1. Thus by Taylor’s theorem, h(t) 6 h(0) + th (0) + t /2, i.e., Z Z −1 tg t log e dµ 6 gdµ + t/2 . Together with (13.10), we have s Z Z t Z 1 Z fgdµ − gdµ + t−1 f log fdµ f log fdµ , 6 2 6 2 where we optimized t in the last step. Since this bound holds for any g with |g| 6 1, using (13.9) we have proved (13.8).

13.2 Entropy on product spaces and conditioning Now we consider a product measure, i.e., we assume that the probability space has a product structure, Ω = Ω1 × Ω2 and µ = µ1 ⊗ µ2, ν = ν1 ⊗ ν2, where µj and νj are probability measures on Ωj, j = 1, 2. For simplicity, we denote the elements of Ω by (x, y) with x ∈ Ω1, y ∈ Ω2. Writing νj = fjµj, clearly we have ZZ   S(ν1 ⊗ ν2|µ1 ⊗ µ2) = Sµ1⊗µ2 (f1 ⊗ f2) = f1(x)f2(y) log f1(x) + log f2(y) µ1(dx)µ2(dy) Ω1×Ω2

= Sµ1 (f1) + Sµ2 (f2) = S(ν1|µ1) + S(ν2|µ1), (13.11) i.e., the entropy is additive for product measures. This property of the entropy makes it an especially suitable tool to measure closeness in very high ⊗N dimensional analysis. For example, if the probability space is a large product space Ω = Ω1 equipped with ⊗N ⊗N two product measures µ = µ1 and ν = f µ, then their relative entropy ⊗N Sµ(f ) = NSµ1 (f) (13.12) grows only linearly in N, the number of degrees of freedom. It is easy to check that any Lp distance (13.5) for p > 1 grows exponentially in N, which usually renders it useless. Thus the entropy is still stronger than the L1 norm, but its growth in N is much more manageable. An additional advantage is that the entropy is often easier to compute than any other Lp-norm (with the exception of L2).

Next we discuss how the entropy decomposes under conditioning. For simplicity, we assume the product structure, Ω = Ω1 ×Ω2 as before. Let ω be a probability measure on the product space Ω. For any integrable function (random variable) u(x, y) on Ω we denote its conditional expectation by

ω ub = E [u|F1], (13.13) where F1 is the σ-algebra of Ω1 canonically lifted to Ω. Note that ub = ub(x) depends only on the x variable. The conditional expectation ub may also be characterized by the following relation: it is the unique measurable function on Ω1 such that for any bounded measurable (w.r.t. F1) function O(x) the following identity holds: ω  ω  E uOb = E uO . Written out in coordinates, it means that ZZ ZZ ub(x)O(x)ω(dx dy) = u(x, y)O(x)ω(dx dy). (13.14) Ω Ω In particular, if ω(dxdy) = ω(x, y)dxdy is absolutely continuous with respect to some reference product measure dxdy on Ω, then we have, somewhat formally, R u(x, y)ω(x, y)dy u(x) = Ω2 . (13.15) b R ω(x, y)dy Ω2

94 The notation dxdy for the reference measure already indicates that in our applications Ωj will be Euclidean nj spaces R of some dimension nj, and the reference measure will be the Lebesgue measure. We remark that the concept of conditioning can be defined in full generality with respect to any sub-σ- algebra; the product structure of Ω is not essential. However, we will not need the general definition in this book and the above definition is conceptually simpler. The conditional expectation gives rise to a trivial martingale decomposition:

u = ub + (u − ub) , (13.16) where ub is F1-measurable, while u − ub has zero expectation on any F1-measurable set. Subtracting the ω ω expectationu ¯ := E u = E ub and squaring this formula, we have the martingale decomposition of the variance of u:

ω ω 2 ω 2 ω 2 ω ω Var (u) := E (u − u¯) = E (u − ub) + E (ub − u¯) = E Var(u(x, ·)) + Var (ub) , (13.17) where we defined the conditional variance

2 2 2 Var(u(x, ·)) := E[(u − ub) |F1] = E[u |F1] − [ub] . The identity (13.17) is a triviality, but its interpretation is important. It means that the variance is additive w.r.t. the martingale decomposition (13.16). The first term Eω Var(u(x, ·)) is the expectation of the variance 2 w.r.t. y conditioned on x; the second term Var(ub − u¯) is the variance of the marginal w.r.t. x. In other words, we can compute the variance one by one. The martingale decomposition has an analogue for the entropy. For simplicity, we assume that ω has a density ω(x, y) w.r.t. a reference measure dxdy. Denote by ωe the marginal probability density on Ω1, Z ωe(x) = ω(x, y)dy , Ω2

and by fb the marginal density of fω w.r.t. ωe, i.e., R Ω f(x, y)ω(x, y)dy fb(x) = 2 . ωe(x) In particular, for any test function O(x) we have ZZ Z O(x)f(x, y)ω(x, y)dxdy = O(x)fb(x)ωb(x)dx . (13.18) Ω Ω1 Let ω(x, y) ωx(y) := ωe(x)

be the probability density on Ω2 conditioned on a fixed x ∈ Ω1. Define

f(x, y) fx(y) := (13.19) fb(x)

to be the corresponding density of the measure fω conditioned on a fixed x ∈ Ω1. Note that with these definitions we have f(x, y)ω(x, y) (fω) (y) = = f (y)ω (y) . x R f(x, y)ω(x, y)dy x x Ω2 Now we are ready to state the following proposition on the additivity of entropy w.r.t. martingale decomposition:

95 Proposition 13.4. With the notations above, we have

S (f) = fbωeS (f ) + S (f) . (13.20) ω E ωx x ωe b In particular, the marginal entropy is bounded by the total entropy, i.e., S (f) S (f) . (13.21) ωe b 6 ω In other words, the entropy is additive w.r.t. the martingale decomposition, i.e., the entropy is the sum of the expectation of the entropy in y conditioned on x and the entropy of the marginal in x. The additivity of entropy in this sense is an important tool in the application of the LSI as we will demonstrate shortly. It also indicates that entropy is an extensive quantity, i.e., the entropy of two probabilities on a product space ΩN is often of order N, generalizing the formulas (13.11)-(13.12) to non-product measures. Proof. We can decompose the entropy as follows: Z Z Z Z Z Sω(f) = f log fdω = f log(f/fb)dω + f log fbdω = f log(f/fb)dω + fblog fbdωe , (13.22) Ω Ω Ω Ω1 where in the last step we used (13.18). The first term we can rewrite as Z Z h Z i f log(f/f)ω(x, y)dxdy = dxω(x)f(x) f (y) log f (y)ω (y)dy = fbωeS (f ) . (13.23) b e b x x x E ωx x

We have thus proved (13.20).

13.3 Logarithmic Sobolev inequality The logarithmic Sobolev inequality requires that the underlying probability space admits a concept of dif- ferentiation. While the theory can be developed for more general spaces, we restrict our attention to the probability space Ω = RN in this subsection. We will work with probability measures µ on RN that are defined by a Hamiltonian H: e−H(x) dµ(x) = dx, (13.24) Z where Z is a normalization factor. In Section 13.7 we will comment on how to extend the results of this N N section to certain subsets of R , most importantly, to the simplex ΣN = {x ∈ R : x1 < x2 < . . . < xN }. Note that for simplicity in this presentation we neglect the βN prefactor compared with (12.13). Let L be the generator of the dynamics associated with the Dirichlet form Z Z N Z 2 X 2 D(f) = Dµ(f) = − f(Lf)dµ := |∇f| dµ = (∂jf) dµ, ∂j = ∂xj . (13.25) j=1 Formally we have L = ∆ − (∇H) · ∇. The operator L is symmetric with respect to the measure dµ, i.e., Z Z Z f(Lg)dµ = (Lf)gdµ = − ∇f · ∇gdµ. (13.26)

We remark that in many books on probability, e.g. [43], the Dirichlet form (13.25) is defined with a factor 1/2, but this convention is not compatible with the 1/(βN) prefactor in (12.14). The lack of this 1/2 factor in (13.25) causes slight deviations from their customary form in the following theorems.

Definition 13.5. The probability measure µ on RN satisfies the logarithmic Sobolev inequality if there exists a constant γ such that Z Z p 2 p S(f) = f log f dµ 6 γ |∇ f| dµ = γD( f) (13.27) R holds for any smooth density function f > 0 with fdµ = 1. The smallest such γ is called the logarithmic Sobolev inequality constant of the measure µ.

96 A simple density argument shows that (13.27) extends from smooth functions to any nonnegative function ∞ N f ∈ C0 (R ), the space of smooth functions with compact support. In fact, it is easy to extend it to all √ 1 1 ∞ N f ∈ H (dµ). Since we will not use the space H (dµ) in this book, we will just use f ∈ C0 (R ) in the LSI.

Theorem 13.6 (Bakry-Emery)´ . [13] Consider a probability measure µ on RN of the form (13.24), i.e., µ = e−H/Z. Suppose that a convexity bound holds for the Hamiltonian, i.e., with some positive constant K we have 2 ∇ H(x) > K (13.28) for any x (in the sense of quadratic forms). Then the logarithmic Sobolev inequality holds (13.27) with an LSI constant γ 6 2/K, i.e., 2 Z S(f) D(pf), for any density f with fdµ = 1. (13.29) 6 K Furthermore, the dynamics ∂tft = Lft, t > 0, (13.30) relaxes to equilibrium on the time scale t  1/K both in the sense of entropy and Dirichlet form: 2 S(f ) e−2tK S(f ),D(pf ) e−tK S(f ). (13.31) t 6 0 t 6 t 0

Proof. Let ft be the solution to the evolution equation (13.30) with a given smooth initial condition f0. Simple calculation shows that the derivative of the entropy S(ft) is given by

Z Z Z 2 Lft (∇ft) p ∂tS(ft) = (Lft) log ftdµ + ft dµ = − dµ = −4D( ft), (13.32) ft ft R where we√ used that Lftdµ = 0 by (13.26). Similarly, we can compute the evolution of the Dirichlet form. Let ht := ft for simplicity, then

1 2 1 2 1 2 ∂tht = ∂tht = Lht = Lht + (∇ht) . 2ht 2ht ht We compute (dropping the t subscript for brevity) Z Z p 2 ∂tD( ft) =∂t (∇h) dµ = 2 ∇h · ∇∂th Z Z (∇h)2 =2 (∇h) · (∇Lh)dµ + 2 (∇h) · ∇ dµ h Z Z Z 2 X h2(∂jh)∂i∂jh (∂jh) ∂ihi =2 (∇h) · [∇, L]hdµ + 2 (∇h) ·L(∇h)dµ + 2 ∂ h − dµ i h h2 ij Z Z Z 2 2 X X h2(∂jh)(∂ih)∂ijh (∂jh) (∂ih) i = − 2 (∇h) · (∇2H)∇hdµ − 2 (∂ ∂ h)2dµ + 2 − dµ i j h h2 ij ij Z Z 2 X  (∂ih)(∂jh) = − 2 (∇h) · (∇2H)∇hdµ − 2 ∂ h − dµ, (13.33) ij h ij

where we used the commutator [∇, L] = −(∇2H)∇. Therefore, under the convexity condition (13.28), we have p p ∂tD( ft) 6 −2KD( ft). (13.34)

97 Integrating (13.34), we have p −2tK p D( ft) 6 e D( f0). (13.35) This proves that the equilibrium is achieved at t = ∞ with f∞ = 1, and both the entropy√ and the Dirichlet form are zero. Integrating (13.32) from t = 0 to t = ∞, and using the monotonicity of D( ft) from (13.34), we obtain Z ∞ Z ∞ p p −2tK 2 p −S(f0) = −4 D( ft)dt > −4D( f0) e dt = − D( f0). (13.36) 0 0 K This proves the LSI (13.29) for any smooth function f = f0√. By a standard density argument, it can be extended to any function f with finite Dirichlet form, D( f) < ∞. For functions with unbounded Dirichlet form we√ interpret the LSI (13.29) as a tautology. In particular, (13.29) holds for ft as well, S(ft) 6 (2/K)D( ft). Inserting this back to (13.32), we have

∂tS(ft) 6 −2KS(ft). Integrating this inequality from time zero, we obtain the exponential relaxation of the entropy on time scale t  1/K −2tK S(ft) 6 e S(f0). (13.37) Finally, we can integrate (13.32) from time t/2 to t to get

t Z p S(ft) − S(ft/2) = −4 D( fτ )dτ. t/2

Using the positivity of the entropy S(ft) > 0 on the left side and the monotonicity of the Dirichlet form (from (13.34)) on the right side, we get 2 D(pf ) S(f ), (13.38) t 6 t t/2 thus, using (13.37), we obtain exponential relaxation of the Dirichlet form on time scale t  1/K 2 D(pf ) e−tK S(f ). t 6 t 0

Standard example. Let µ be the centered Gaussian measure with variance a2 on Rn, i.e.,

1 2 2 dµ(x) = e−x /2a dx. (13.39) (2πa2)n/2

Written in the form e−H/Z with the quadratic Hamiltonian H(x) = x2/2a2, we see that (13.28) holds with K = a−2. Then (13.29) with the replacement f → f 2 yields that Z Z 2 2 2 2 f log f dµ 6 2a (∇f) dµ (13.40)

for any normalized R f 2dµ = 1. The constant of this inequality is optimal. Alternative formulation of LSI. We remark that there is another formulation of the LSI (see [99]). To make this connection, let 1 2 2 g(x) := f(x) e−x /4a , (13.41) (2πa2)n/4 and notice that R g2(x)dx = R f 2dµ, where dx is the Lebesgue measure and dµ is of the form (13.39). Rewriting (13.40) to an inequality for g and with the replacement 2πa2 → a2, we get the following version of the LSI Z Z Z 2 2 2 2 2 2 g log(g /kgk )dx + n[1 + log a] g dx 6 (a /π) |∇g| dx (13.42) n n n R R R

98 that holds for any a > 0 and any function g, where kgk = (R g2dx)1/2 is the L2 norm with respect to the Lebesgue measure.

Proposition 13.7 (LSI implies spectral gap). Let µ satisfy the LSI (13.27) with a LSI constant γ. Then for any v ∈ L2(µ) with R vdµ = 0 we have

Z γ Z γ v2dµ |∇v|2dµ = D(v) , (13.43) 6 2 2 i.e., µ has a spectral gap of size at least γ/2.

Proof. By definition of the LSI constant, we have Z √ u log udµ 6 γD( u) for any u with R udµ = 1. For any bounded smooth function v with R vdµ = 0, define u = 1 + εv. Then we have Z γ Z |∇v|2 ε−2 (1 + εv) log(1 + εv)dµ dµ . (13.44) 6 4 1 + εv 1 R 2 Taking the limit ε → 0, we get that the right hand side converges to 2 v dµ by dominated convergence. This proves the proposition.

Proposition 13.8 (Concentration inequality (Herbst bound)). Suppose that the measure µ satisfies the LSI with a constant γ. Let F be a function with EµF = 0. Then we have γ  µ e F exp k∇F k2 . (13.45) E 6 4 ∞ where s X 2 k∇F k∞ := sup |∂iF (x)| . x i In particular, we have 2 µ  α  P (|F | > α) 6 exp − 2 (13.46) γk∇F k∞ for any α > 0.

Notice that we get an exponential tail estimate from the LSI. If we only have the spectral gap estimate, (13.43), we can only bound the variance of F which yields a quadratic tail estimate

γk∇F k2 µ(|F | α) ∞ . P > 6 2α2

2 −1 In our typical applications, we have γk∇F k∞  N . We often need to control the concentration of many ( N C ) different functions F in parallel. Thus the simple union bound is applicable with the LSI but not with the spectral gap estimate. Proof. Denote by exp etF  u = u(t) := . Eµ exp etF By differentiation and the LSI, we have d h i √ e−t log µ exp etF  = e−t µu log u e−tγ µ|∇ u|2. (13.47) dt E E 6 E

99 Clearly, √ e2t µ|∇ u|2 k∇F k2 . (13.48) E 6 4 ∞ Integrating from any t < 0 to 0 yields that Z 0 h µ i h −t µ t i γ 2 s log E exp F − e log E exp e F = k∇F k∞ ds e . (13.49) 4 t From the condition EµF = 0 we have

h −t µ t i 1 µ εF 1 µ 2 εF  lim e log E exp e F = lim log E e = lim log E 1 + εF + O(ε e ) = 0 t→−∞ ε→0 ε ε→0 ε by dominated convergence. This proves the inequality (13.45). The concentration bound (13.46) follows from an exponential Markov inequality:  γ  µ(F α) µet(F −α) exp − αt + t2 k∇F k2 , P > 6 E 6 4 ∞ 2 where we chose the optimal t = 2α/[γk∇F k∞] in the last step. Changing F → −F we obtain the opposite bound and thus we proved (13.46).

Proposition 13.9 (Stability of the LSI constant). Consider two measures µ, ν on RN that are related by ν = gµ with some bounded function g. Let γµ denote the LSI constant for µ. Then

−1 γν 6 kgk∞kg k∞γµ. R R −1 Proof. Take an arbitrary function f > 0 with fdν = 1. Let α := fdµ 6 kg k∞ and by definition of the entropy we have Z Sµ(f/α) = (f/α) log(f/α)dµ.

The following inequality for any two nonnegative numbers a, b can be checked by elementary calculus:

a log a − b log b − (1 + log b)(a − b) > 0. Hence, Z h i f log f − α log α − (1 + log α)(f − α) dν Z h i 6 kgk∞ f log f − α log α − (1 + log α)(f − α) dµ = kgk∞αSµ(f/α) .

The left hand side of the last inequality is equal to

Sν (f) − [log α − (α − 1)] > Sν (f) , where we have used the concavity of log. This proves that

Sν (f) 6 kgk∞αSµ(f/α) .

Suppose that the LSI holds for µ with constant γµ. Then we have Z Z p 2 p 2 −1 Sν (f) 6 kgk∞γµα (∇ f/α) dµ = kgk∞γµ (∇ f) g dν Z −1 p 2 6 kgk∞kg k∞γµ (∇ f) dν , and this proves the lemma.

100 Proposition 13.10 (Tensorial property of the LSI). Consider two probability measures µ, ν on Rn, Rm respec- tively. Suppose that the LSI holds for them with LSI constants γµ and γν , respectively. Let ω = µ ⊗ ν be the m+n product measure on R . Then the LSI holds for ω with LSI constant γω 6 max{γµ, γν }. n m Proof. We will use the notations introduced in Section 13.2 with Ω1 = R ,Ω2 = R . Recall the additivity of entropy (13.20) w.r.t. martingale decomposition. In the current situation, ω = µ ⊗ ν and thus ωb = µ and ωx = ν for any x. Furthermore, Z Z −1 fb(x) = ωb(x) f(x, y)ω(x, y)dy = f(x, y)ν(y)dy . (13.50) By the additivity of entropy and the LSI w.r.t. µ and ν, we have

fµb Sω(f) = E Sν (fx) + Sµ(fb) s Z Z |∇ pf(x, y)|2 Z Z y 6 γν fb(x)µ(x)dx ν(y)dy + γµ ∇x f(x, y)ν(y)dy µ(x)dx. (13.51) fb(x) The integral in the first term on the right hand side is equal to ZZ p 2 |∇y f(x, y)| µ(x)ν(y)dxdy.

The integral of the second term on the right hand side is bounded by Z R 2 ZZ ∇xf(x, y)ν(y)dy p 2 µ(x)dx ∇x f(x, y) µ(x)ν(y)dxdy, 4 R f(x, y)ν(y)dy 6 √ √ where we have written ∇xf = 2(∇x f) f and used the Schwarz inequality. Summarizing, we have proved that Z Z p 2 p 2 Sω(f) 6 γν |∇y f(x, y)| ω(x, y)dxdy + γµ ∇x f(x, y) ω(x, y)dxdy, and this proves the proposition.

The following lemma is a useful tool to control the entropy flow w.r.t non-equilibrium measure.

Lemma 13.11. [139] Suppose we have evolution equation ∂tft = Lft with L defined via the Dirichlet form P R 2 hf, (−L)fiµ = j (∂jf) dµ. Then for any time dependent probability density ψt w.r.t. µ, we have the entropy flow identity Z Z X √ 2 ∂tSµ(ft|ψt) = −4 (∂j gt) ψt dµ + gt(L − ∂t)ψt dµ , j

where gt := ft/ψt and Z ft Sµ(ft|ψt) := ft log dµ = S(ftµ|ψtµ) ψt is the relative entropy of ftµ w.r.t. ψtµ. Proof. A simple computation then yields that Z Z Z d Lft Sµ(ft|ψt) = (Lft)(log gt) dµ + ft dµ − (∂tψt) gt dµ dt ft Z Z = (Lft)(log gt) dµ − (∂tψt) gt dµ Z Z = (gtψt) L(log gt) dµ − gt∂tψt dµ Z Z h Lgt i = ψt gtL(log gt) − gt dµ + gt(L − ∂t)ψt dµ. gt

101 By definition of L, we have

2 Lg X (∂jg) X √ 2 L(log g) − = − = −4 ∂ g , (13.52) g g j j j and we have proved Lemma 13.11. We remark that stochastic processes and their generators can be defined in more general setups, e.g. the underlying probability spaces may be different from RN and the generators may involve discrete jumps. In this case, the stochastic generator L is defined directly, without an a priori Dirichlet form. It can, however, be proved that −g[L(log g) − (Lg)/g] > 0 which can then be viewed as a generalization of the “Dirichlet form” associated with a generator.

13.4 Hypercontractivity We now present an interesting connection between the LSI of a probability measure µ and the hypercon- tractivity properties of the semigroup generated by L = Lµ. Since this result will not be used later in this book, this section can be skipped.

To state the result, we define the semigroup {Pt}t>0 by Ptf := ft, where ft solves the equation ∂tft = Lft with initial condition f0 = f. n Theorem 13.12 (L. Gross [78]). For a measure µ on R and for any fixed constants β > 0 and γ > 0 the following two statements are equivalent:

(i) The generalized LSI Z Z p 2 R f log f dµ 6 γ |∇ f| dµ + β, for any f > 0 with fdµ = 1, (13.53)

holds; (ii) The hypercontractivity estimate  1 1 kP fk q exp β − kfk p (13.54) t L (µ) 6 p q L (µ) holds for all exponents satisfying q − 1 e4t/γ , 1 < p q < ∞. p − 1 6 6

Proof. We will only prove (i)⇒(ii), i.e., that the generalized LSI implies the decay estimate, the proof of the converse statement can be found in [43]. First we assume that f > 0, hence ft > 0. Direct differentiation yields the following identity d p˙(t)  4(p(t) − 1) Z  log kf k = − D(u(t)) + u(t)2 log(u(t)2)dµ , (13.55) dt t p(t) p(t)2 p˙(t)

d withp ˙(t) = dt p(t) and where we defined Z p(t)/2 −p(t)/2 2 u(t) := ft kftkp(t) ,D(u) = (∇u) dµ.

Now choose p(t) to solve the differential equation 4(p(t) − 1) γ = , with p(0) = p, i.e., p(t) = 1 + (p − 1)e4t/γ , p˙(t)

102 where γ is the constant given in the theorem. Using (13.53) for the L1(µ)-normalized function u(t)2, we have d βp˙(t) log kf k . dt t p(t) 6 p(t)2 Integrating both sides, from t = 0 to some T we have

h1 1 i log kf k − log kfk β − . T p(T ) p 6 p p(T )

Choosing T such that p(T ) = q, we have proved (13.54) for f > 0. The general case follows from separating the positive and negative parts of f.

Exercise. In this exercise, we show that the idea of the LSI can be useful even if the invariant measure is not a probability measure. The sketch below follows the paper by Carlen-Loss [33] and it works for any parabolic equation of the type

∂tft = [∇ · (D(x, t)∇) + b(x, t) · ∇]ft , for any divergence free b and D(x, t) > c > 0. For simplicity of notation, we consider only the heat equation on Rn ∂tft = ∆ft . (13.56)

The invariant measure of this flow is the standard Lebesgue measure on Rn.

(i) Check the formula (13.55) in this setup, i.e., show that the following identity holds:

d 1 d Z p˙(t) log kf k = kf k−p(t) f p(t)dx − log kf kp (13.57) dt t p(t) p(t) t p(t) dt t p(t)2 t p p˙(t)  4(p(t) − 1) Z Z  = − (∇u(t))2dx + u(t)2 log(u(t)2)dx , p(t)2 p˙(t)

where all norms are w.r.t. the Lebesgue measure and

p/2 −p/2 u(t) = ft kftkp , p = p(t) .

(ii) For any t fixed, choose the constant in (13.42) by setting

4(p(t) − 1) a2 = . p˙(t)

Thus we have d np˙(t)  1 4π(p(t) − 1) log kf k − 1 + log . dt t p(t) 6 p(t)2 2 p˙(t) For a suitable choice of the function p(t) with p(0) = 1, p(T ) = ∞, show that

−n/2 kfT k∞ 6 (4πT ) kf0k1. (13.58)

We remark that in the case of D ≡ 1 and b ≡ 0, this heat kernel estimate is also a trivial consequence of the explicit formula for the heat kernel et∆(x, y). The point is, as remarked earlier, that this proof works for a general class of second order parabolic equations.

103 13.5 Brascamp-Lieb inequality The following inequality will not be needed for the main result of this book, so the reader may skip it. Nevertheless, we included it here, since it is an important tool in the analysis of probability measures in very high dimensions and it was used in universality proofs for the invariant ensembles [25]. We will formulate it on RN and in Section 13.7 we comment on how to extend it to certain probability measures on the simplex ΣN defined in (12.3).

Theorem 13.13 (Brascamp-Lieb inequality [29]). Consider a probability measure µ on RN of the form (13.24), i.e., µ = e−H/Z. Suppose that the Hamiltonian is strictly convex, i.e.,

2 ∇ H(x) > K > 0 (13.59) as a matrix inequality for some positive constant K. Then for any smooth function f ∈ L2(µ) function we have  2 −1 hf; fiµ 6 h∇f, ∇ H ∇fiµ . (13.60) R Recall that hf, giµ = fgdµ denotes the scalar product and hf; giµ = hf, giµ − h1, fiµh1, giµ is the PN covariance. With a slight abuse of notation, we also use the notation hF, Giµ = i=1hFi,Giiµ for the scalar product of any two vector-valued functions F, G : RN → RN ; this extended scalar product is used in the right hand side of (13.60). Remark: Notice that the Brascamp-Lieb inequality is optimal in the sense that, if µ is a Gaussian measure, then we can find f so that the inequality becomes equality. Moreover, if we use the matrix 2 inequality ∇ H > K, then (13.60) is reduced to the spectral gap inequality (13.43), but of course (13.60) is stronger. The following proof is due to Helffer-Sj¨ostrand[80] and Naddaf-Spencer [108]. Proof. Recall that L is the generator for a reversible dynamics with reversible measure µ defined in (13.26). For the dynamics (13.30), we have

d X hf, etLgi = hf, LetLgi = − h∂ f, ∂ etLgi. dt xj xj j

Define  tL  G(t, x) := (G1(t, x),...,GN (t, x)),Gj(t, x) := ∂xj e g(x) . Recall the explicit form of L: X L = ∆ − (∇H) · ∇ = ∆ + bk∂xk , bk := −∂xk H. k

Define a new generator L on any vector-valued function G : RN → RN by X (LG)j(x) := LGj(x) + (∂xj bk)Gk(x). k Clearly, by explicit computation, we have

∂tG(t, x) = LG(t, x).

tL From the decay estimate (13.31), the dynamics are mixing, i.e., limt→∞hf, e gi = 0 for any smooth function f with R fdµ = 0. Thus for any such function f we have

Z ∞ d X Z ∞ Z ∞ hf, gi = − hf, etLgi dt = h∂ f, G (t, x)i dt = h∇f, etL∇gi dt µ dt µ xj j µ µ 0 j 0 0 −1 = h∇f, (−L) ∇giµ.

104 Taking g = f and from the definition of L, we have

−(L)i,j = −Lδi,j − (∂xi bj) > (∂xi ∂xj H) as an operator inequality and we have proved that

−1  2 −1 hf, fiµ = h∇f, (−L) ∇fiµ 6 h∇f, ∇ H ∇fiµ.

13.6 Remarks on the applications of the LSI to random matrices The Herbst bound provides strong concentration results for probability measures that satisfy the LSI. In this section we exploit this connection for random matrices. For definiteness, we will consider the Gaussian orthogonal ensemble (GOE), although the arguments below can be extended to more general Wigner as well as invariant ensembles. There are two ways to view Gaussian matrix ensembles in the context of the LSI: we can either work on the probability space of the matrix H with Gaussian matrix elements, or we can work directly on the space of the eigenvalues λ by using the explicit formula (4.12) for invariant ensembles. Both measures satisfy the LSI (in the next Section 13.7 we will comment on how to generalize the LSI from RN to the simplex ΣN , (12.3), a version of the LSI that we will actually use). We will explore both possibilities, and we start with the matrix point of view.

13.6.1 LSI from the Wigner matrix point of view

2 Let µ = µG be the standard Gaussian measure µ ∼ exp (−N Tr H ) on symmetric matrices. Notice that 1/2 for this measure the family {xij = N hij, i 6 j} is a collection of independent standard Gaussian random variables (up to a factor 2). Hence the LSI holds for every matrix element and by its tensorial property, i.e., Proposition 13.10, the LSI holds for any function of the full matrix considered as a function of this collection 1/2 {xij = N hij, i 6 j}. In particular, using the spectral gap estimate (13.43) for the Gaussian variables xij, we have for any function F = F (H), X Z C X Z hF (H); F (H)i C |∂ F (H)|2dµ = |∂ F (H)|2dµ, (13.61) µ 6 xij N hij i6j i6j

−1 −1/2 where the additional N factor comes from the scaling hij = N xij. Suppose that F (H) = R(λ(H)) is a real function of the eigenvalues λ = (λ1, . . . , λN ), then by the chain rule we have C Z C Z ∂λ 2 X 2 X X α |∂hij R(λ(H))| dµ = ∂λα R(λ) dµ. N N ∂hij i6j i6j α Expanding the square and using the perturbation formula (12.9) with real eigenvectors, we can compute 1 Z ∂λ 2 1 2 Z ∂λ 2 X X α X X α ∂λα R(λ) dµ 6 ∂λα R(λ) dµ N ∂hij N 2 − δij ∂hij i6j α i6j α Z 1 X 2 X ∂λα ∂λβ = ∂λα R(λ)∂λβ R(λ) dµ N 2 − δij ∂hij ∂hij i6j α,β 2 X X Z = [2 − δ ] ∂ R(λ)∂ R(λ)u (i)u (j)u (i)u (j)dµ N ij λα λβ α α β β i6j α,β 2 X X Z = ∂ R(λ)∂ R(λ)u (i)u (j)u (i)u (j)dµ N λα λβ α α β β i,j α,β 2 X Z = |∂ R(λ)|2dµ, (13.62) N λα α

105 where we have used the orthogonality property and the normalization convention of the eigenvectors in the last step. We remark that the annoying factor [2 − δij] can be avoided if we first consider λα as function of all {xij : 1 6 i, j 6 N} as independent variables. Then the perturbation formula (12.9) becomes ∂λ α = uα(i)uα(j), (13.63) ∂hij hij =hji i.e., the derivative evaluated on the submanifold of Hermitian matrices. In this way we can keep the sum- mation in (13.61) unrestricted and up to a constant factor, we will get the same final result as in (13.62). In summary, we proved C X Z hF ; F i = hR(λ(H)); R(λ(H))i |∂ R(λ)|2dµ. (13.64) µ µ 6 N λα α Notice that this argument holds for any generalized Wigner matrix as long as a spectral gap estimate (13.43) 1/2 holds for the distribution of every rescaled matrix element N hij. Furthermore, it can be generalized for Wigner matrices with Bernoulli random variables for which there is a spectral gap (and LSI) in discrete form. Similar remarks apply to the following LSI estimates. To see how good this estimate is, consider a local linear statistics of eigenvalues, i.e., the local average of K consecutive eigenvalues with label near M, i.e., set 1 X α − M  R(λ) := A λ , K K α α where A a smooth function of compact support and 1 6 M,K 6 N. Computing the right hand side of (13.64) and combining it with (13.61), we get

C X α − M  CA hF ; F i A2 , (13.65) µ 6 NK2 K 6 NK α

where the last constant CA depends on the function A. For K = 1, this inequality estimates the square of the fluctuation of a single eigenvalue (choosing A appropriately). The bound (13.65) is off by a factor N since the true fluctuation is of order almost 1/N by rigidity, Theorem 11.5, at least in the bulk, i.e., if δN 6 M 6 (1 − δ)N. On the other hand, for K = N the bound (13.65) is much more precise; it shows that the variance of a macroscopic average of the eigenvalues is at most of order N −2. This is the correct P α  order of magnitude, in fact, it is known that α A N λα converges to a Gaussian random variable, see, e.g. [102, 117] and references therein. Hence the spectral gap argument yields the optimal (up to constant) result for any macroscopic average of eigenvalues. Another common quantity of interest is the Stieltjes transform of the empirical eigenvalue distribution, i.e., 1 X 1 F (H) = G(λ(H)) with G(λ) = m (z) = , z = E + iη. (13.66) N N λ − z α α In this case, using (13.64), we can estimate C X Z C X Z 1 C hm (z); m (z)i |∂ G(λ)|2dµ = dµ , (13.67) N N µ 6 N λα N 3 |λ − z|4 6 N 2η4 α α α

where we used |λα − z| > Im z = η. −1/2 This shows that the variance of mN (z) vanishes if η  N . The last estimate can be improved if we know that the density of the empirical measure is bounded. Roughly speaking, if we know that in the summation over α, the number of eigenvalues in an interval of size η near E is at most of order Nη, then the last estimate can be improved to 1 X Z 1 1 X Z 1 C 1 C dµ dµ (Nη) = . (13.68) N 3 |λ − z|4 . N 3 |λ − z|4 6 N 3 η4 N 2η3 α α α α |λα−E|6η

106 −2/3 This shows that the variance of mN (z) vanishes if η  N . Since vanishing fluctuations can be used to estimate the density, this argument can actually be made rigorous by a bootstrapping argument [61]. Notice that the scale η  N −2/3 is still far from the resolution demonstrated in the local semicircle law, Theorem 6.7. Whenever the LSI is available, the variance bounds can be easily lifted to a concentration estimate with exponential tail. We now demonstrate this mechanism for the fluctuation of a single eigenvalue. In other µ words, we will apply (13.46) with F (H) = λα(H) − E λα(H) for a fixed α. From (12.9), we have

X 2 |∇ F |2 , xij 6 N i6j thus (13.46) implies µ µ  2 P (|λα − E λα| > t) 6 exp − cNt (13.69) for any α fixed. This inequality has two deficiencies compared with the rigidity bounds in Theorem 11.5: µ −1/2 −1 First, the control of |λα − E λα| is only up to N which is still far from the optimal N (in the bulk). µ Second, it does not give information on the difference between the expectation E λα and the corresponding classical location γα defined as the α-th N-quantile of the limiting density, see (11.31).

13.6.2 LSI from invariant ensemble point of view

Now we pass to the second point of view, where the basic measure µG is the invariant ensemble on the eigenvalues. One might hope that the situation can be improved since we look directly at the Gaussian 1 2 eigenvalue ensemble defined in (12.13) with the Gaussian choice for V (λ) = 2 λ . Notice that the role of H in Theorem 13.6 will be played by NHN defined in (12.13) (with β = 1). The Hessian of HN is given by (all inner products and norms in the following equation are w.r.t the standard inner product in RN )

2 1 1 X (vi − vj) 1 v, ∇2H (x)v kvk2 + kvk2, v ∈ N , (13.70) N > 2 N (x − x )2 > 2 R i

C X Z hR(λ); R(λ)i |∂ R(λ)|2dµ . (13.71) µG 6 N λα G α

Notice that this bound is in the same form as in (13.64). A similar statement holds for the LSI, i.e., we have

Z C X Z R log R dµ |∂ pR(λ)|2dµ . (13.72) G 6 N λα G α

We can apply the concentration inequality (13.46) to the function R(λ) = λα to get (13.69). Once again, we get exactly the same result as considered earlier, so there is no advantage of using the explicit invariant measure µG. We have seen that in both the matrix and the invariant ensemble setup, even for Gaussian ensembles, concentration estimates based on spectral gap or LSI do not yield optimal results for the eigenvalues on short scales. In the matrix setup, the proof of the optimal rigidity bounds for Wigner matrices, Theorem 11.5, uses a completely different approach independent of the LSI. In the invariant ensemble setup, while the convexity bound (13.70) cannot be improved for a general vector v, it becomes much stronger if we consider it only P for vectors v satisfying j vj = 0 due to the structure of the Hessian (13.70). In the next section, we will explore this idea to improve estimates on certain functions of the eigenvalues.

107 13.7 Extensions to the simplex, regularization of the DBM

In the above application to either spectral gap or LSI, the measure µG was restricted to the subset Σ = N ΣN = {x ∈ R : x1 < x2 < . . . < xN }. It is absolutely continuous with respect to the Lebesgue measure on ΣN , but if we write it in the form (13.24) then the Hamiltonian H has to be infinite outside of ΣN ; in particular it is not differentiable, a property that is implicitly assumed in Section 13.3. This issue is closely related to the boundary terms in the integration by parts, especially in (13.33), that arise if one formally extends the proofs to Σ. Therefore, the results of Theorem 13.6 do not apply directly. The proper procedure N involves an approximation and extension of the measure from ΣN to R , using the results of Section 13.3 for the regularized measure on RN and then removing the regularization. Whether the regularization can be removed for any β > 0 or only for β > 1 depends on the statement, as we explain now. For simplicity, we work in the setup of the Gaussian β-ensemble, i.e., we set

N X 1 1 X H(x) = x2 − log(x − x ), 4 i N j i i=1 i

−1 −NβH(x) and define µβ(x) = Zµ e on the simplex ΣN , exactly as in (12.13) with the Gaussian potential 1 2 V (x) = 2 x and with any parameter β > 0. As usual, Zµ is a normalization constant. Recall from (12.4) that the DBM on the simplex ΣN is defined via the stochastic differential equation √ 2  1 1 X 1  dxi = √ dBi + − xi + dt for i = 1,...,N, (13.73) βN 2 N xi − xj j6=i

where B1,...,BN is a family of independent standard Brownian motions. The unique strong solution exists only if β > 1 (Theorem 12.2) with equilibrium measure µβ. Since DBM does not exists on ΣN unless β > 1, we cannot use DBM either in the SDE form (12.4) or in the PDE form (12.17) when β < 1, see a remark below (12.17). On the other hand, certain results may be extended to any β > 0 if their final formulations do not involve DBM. In this section, we present a regularization procedure to show that substantial parts of the main results of Theorem 13.6, i.e., the LSI for any β > 0 and exponential relaxation decay of the entropy for β > 1, remain valid on the simplex ΣN . A similar generalization holds for the Brascamp-Lieb inequality. In the next section, the same regularization will be used to show that the key Dirichlet form inequality (Theorem 14.3) also hold for β > 0. For later applications, we work with a slightly bigger class of measures than just µβ. We consider measures on Σ = ΣN of the form −1 −βNHb ω = Zω e µβ, P where Hb(x) = j Uj(xj) for some convex real valued functions Uj on R. The total Hamiltonian of ω is N Hω := H + Hb. Note that Uj are defined and convex on the entire R . The entropy and the Dirichlet form are defined as before: Z Z 1 2 Sω(f) = f log f dω, Dω(f) = |∇f| dω. Σ βN Σ The corresponding DBM is given by √   2 1 0 1 X 1 dxi = √ dBi + − xi − Ui (xi) + dt. (13.74) βN 2 N xi − xj j6=i We assume a lower bound on the Hessian of the total Hamiltonian,

00 00 00 1 00 Hω = H + Hb > + min min Uj (x) > K (13.75) 2 j x∈R N for some positive constant K on the entire R . This bound (13.75) plays the role of (13.28). Let Dω, Sω and Lω denote the Dirichlet form, entropy and generator corresponding to the measure ω. Now we claim that Theorem 13.6 holds for the measure ω on ΣN in the following form:

108 Theorem 13.14. Assume (13.75). Then for β > 0, the LSI holds, i.e.,

2 S (f) D (pf) (13.76) ω 6 K ω

R ∞ √ ∞ for any nonnegative normalized density√ f on ΣN , fdω = 1, that satisfies f ∈ L and ∇ f ∈ L . For β > 1 the requirement that f and ∇ f are bounded can be removed. Moreover, the Brascamp-Lieb inequality also holds for any β > 0, i.e., for any bounded function f ∈ 2 L (ΣN , dω) we have  00 −1 hf; fiω 6 h∇f, Hω ∇fiω. (13.77) For β > 1 the requirement that f is bounded can be removed. Furthermore, for any β > 1 the dynamics ∂tft = Lωft with initial condition f0 = f is well defined on ΣN and it approaches to equilibrium in the sense that

−2tK Sω(ft) 6 e Sω(f). (13.78) The inequalities above are understood in the usual sense that they are relevant only when the right hand side is finite. Moreover, by a standard density argument they extend to the closure of the corresponding spaces. For example, (13.76) holds for any f that can be approximated by a sequence of bounded normalized ∞ √ ∞ √ √ densities fn ∈ L with ∇ fn ∈ L such that Dω( fn − f) → 0. Before starting the formal proof, we explain the key idea. For the proofs of (13.76) and (13.77), we extend N the measure ω from Σ to the entire R in a continuous way by relaxing the strict ordering xi < xi+1 imposed by Σ but heavily penalizing the opposite order. In this way we can use Theorem 13.6 for the regularized measure and we avoid the problematic boundary terms in the integration by parts in (13.33).√ At the end we remove the regularization using the additional boundedness assumptions on f and ∇ f. For β > 1 these ∞ 1 additional assumptions are not necessary using that C0 (Σ) functions are dense in H (Σ, dω). The entropy decay (13.78) will follow from the LSI and the time-integral version of the entropy dissipation (13.32) that can be proven directly on Σ if β > 1. We remark that there is an alternative regularization method that mimics the proof of Theorem 13.6 directly on Σ for β > 1. This is based on introducing carefully selected cutoff functions in order to ap- proximate ft by a function compactly supported on Σ. The compact support renders the boundary terms in the integration by parts zero, but the cutoff does not commute with the dynamics; the error has to be tracked carefully. The advantage of this alternative method is that it also gives the exponential decay of the Dirichlet form and it is also the closest in spirit to the strategy of the proof of Theorem 13.6 on RN . The disadvantage is that it works only for β > 1; in particular it does not yield the LSI for β ∈ (0, 1). We will not discuss this approach in this book, the interested reader may find details in Appendix B of [64].

δ −1 −βNHδ N Proof. For any δ > 0 define the extension µβ = Zµ,δe of the measure µβ from ΣN to R by replacing the singular logarithm with a C2-function. To that end, we introduce the approximation parameter δ > 0 and define for x ∈ RN , X 1 1 X H (x) := x2 − log (x − x ) , (13.79) δ 4 i N δ j i i i

where we set  x − δ 1  log (x) := 1(x δ) log x + 1(x < δ) log δ + − (x − δ)2 , x ∈ . δ > δ 2δ2 R

2 It is easy to check that logδ ∈ C (R), is concave, and satisfies ( log x if x > 0 lim logδ(x) = δ→0 −∞ if x 6 0 .

109 The convergence is monotone decreasing, i.e.,

logδ(x) > logδ0 (x), x ∈ R, (13.80)

0 for any δ > δ . Furthermore, we have

( 1 x if x > δ ∂x logδ(x) = 2δ−x δ2 if x 6 δ , and the lower bound

( 1 2 − x2 if x > δ ∂x logδ(x) > 1 − δ2 if x 6 δ ,

00 1 N in particular Hδ is convex with Hδ > 2 . Similarly, we can extend the measure ω to R by setting

−1 −βNHb δ ωδ := Zω,δe µβ , (13.81)

and its Hamiltonian still satisfies (13.75). By the monotonicity (13.80), we know that e−βNHδ (x) is pointwise monotonically decreasing in δ and clearly Zµ,δ and Zω,δ converge to 1 as δ → 0. Clearly, ωδ → ω as δ → 0 N in the sense that for any set A ⊂ R we have ωδ(A) → ω(A ∩ ΣN ). We remark that the regularized DBM corresponding to the measure ωδ is given by √ 2  1 1 X 1 X  dx = √ dB + − x − U 0(x ) + log0 (x − x ) − log0 (x − x ) dt. (13.82) i βN i 2 i i i N δ i j N δ j i ji

R √ R √ 2 Given a function f on ΣN such that fdω = 1 and Dω( f) = |∇ f| dω < ∞, we extend it by ΣN ΣN N symmetry to R . To do so, we first define the symmetrized version of ΣN , i.e., N  ΣeN := R \ x : ∃i 6= j, xi = xj , (13.83) which has the disjoint union structure [ ΣeN = π(ΣN ),

π∈SN where π runs through all N-element permutations and it acts by permuting the coordinates of any point N x ∈ R . For any x ∈ ΣeN there is a unique π so that x ∈ π(ΣN ) and we then define the extension fe by −1 −1 fe(x) := f(π (x)). Clearly, π (x) is simply the coordinates of x = (x1, x2, . . . , xN ) permuted in increasing N ∞ order. Thus fe is defined on R apart√ from a zero measure set and it is bounded since f ∈ L . Furthermore, ∇[(fe)1/2] is also bounded since ∇ f ∈ L∞. R R Since fe is bounded, we have N fedωδ → Σ fdω = 1, so there is a constant Cδ such that fδ := Cδfe R R N is a probability density, i.e., N fδdωδ = 1 and clearly Cδ → 1 as δ → 0. Applying Theorem 13.6 to the R N measure ωδ and the function fδ on R , we see that the LSI holds for ωδ, i.e., 2 S (f ) D (pf ), ωδ δ 6 K ωδ δ or equivalently, Z q 2Cδ   Cδ felog fedωδ + log Cδ 6 Dωδ fe . N K R 1/2 Now we let δ → 0. Using the boundedness of ∇[(fe) ], the weak convergence of ωδ to ω and√ that Zω,δ and Cδ converge to 1, the Dirichlet form on the right side of the last inequality converges to Dω( f). The first term on the left side of the inequality converges to R f log fdω by dominated convergence, and the second term converges to zero. Thus we arrive at (13.76).

110 √ In the above argument, the boundedness of f and ∇ f were only used to ensure that f or rather q ˜ its extension fe has finite integral and the Dirichlet form w.r.t. the regularized measure ωδ, Dωδ ( f ), √ N converges to Dω( f ). For β > 1 we can remove these conditions by using a different extension of f to R 1 √ if f ∈ H (Σ, dω). We may assume that Dω( f) < ∞; otherwise (13.76) is a tautology. We first still assume ∞ that f ∈ L (Σ). We smoothly cut off f to be zero at the boundary of ΣN , i.e., we find a nonnegative ∞ √ √ 1 R sequence fε ∈ C0 (ΣN ) such that fε → f in H (Σ.dω) and fεdω = 1. For β > 1 the existence of a 1 similar sequence but fε → f in H was shown in Section 12.4. In fact, the same construction also shows √ √ 1 that we can also guarantee fε → f in H (Σ, dω). Now we use the LSI for the smooth functions fε, i.e., 2 S (f ) D (pf ), ω ε 6 K ω ε √ and we let ε → 0. The right hand side converges to Dω( f) by the above choice of fε. For the left hand side, recall that apart from a smoothing which can be dealt with via standard approximation arguments, the cutoff function was constructed in the form fε(x) = Cεφε(x)f(x), where φε(x) ∈ (0, 1) with φε % 1 monotonically pointwise and Cε is a normalization such that Cε → 1 as ε → 0. Clearly, Z Z Z Sω(fε) = Cε log Cε φεfdω + Cε fφε log φεdω + Cε φεf log fdω. Σ Σ Σ R The first term converges to zero since φεfdω 6 1 and Cε log Cε → 0. The second term also converges to zero by dominated convergence since |φε log φε| is bounded by one and goes to zero pointwise. Finally, the last term converges to Sω(f) by monotone convergence. This proves√ (13.76) for β > 1 for any bounded f. Finally we remove this last condition. Given any f with Dω( f) < ∞, we define fM := min{f, M} and feM := CM fM , where CM is the normalization. Clearly, CM → 1 and fM % f pointwise. Since (13.76) holds for feM , we have Z Z 2 Z C log C f dω + C f log f dω C |∇pf |2dω. (13.84) M M M M M M 6 K M M

Now we let M → ∞. The first term on the left is just log CM → 0. The second term on the left converges√ to Sω(f) by monotone convergence and CM → 1 and similarly the right hand side converges to (2/K)Dω( f) by monotone convergence. This proves (13.76) for β > 1 without any additional condition on f. The Brascamp-Lieb inequality, (13.77), is proved similarly, starting from its regularized version on RN

 00 00−1 hf; fiωδ 6 h∇f, Hδ + Hb ∇fiωδ that follows directly from Theorem 13.13. We can then take the limit δ → 0 using monotone convergence 00  00 00−1 on the left and the dominated convergence on the right, using that Hδ → H and the inverse Hδ + Hb is uniformly bounded. For the third part of the theorem, for the proof of (13.78), we first note that the remark after (12.17) applies to the generator Lω as well, i.e., β > 1 is necessary for the well-posedness of the equation ∂tft = Lωft on ΣN with initial condition f0 supported on ΣN . The construction of the dynamics in Section 12.4 also 1 2 implies that ft ∈ H (dω) for any t > 0 if f ∈ L (dω). We√ now mimic the proof of the entropy dissipation (13.32) in our setup. Since we do not know that Dω( ft) < ∞, we have to introduce a regularization c > 0 to keep ft away from zero. We compute Z Z Z d Lft ft log(ft + c)dω = (Lft) log(ft + c) dω + ft dω dt ft + c Z |∇f |2 Z cLf = − t dω − t dω ft + c ft + c Z 2 Z 2 |∇ft| c|∇ft| = − dω − 2 dω. (13.85) ft + c (ft + c)

111 Owing to the regularization c > 0, we used integration by parts for H1(dω) functions only. Dropping the last term that is negative and integrating from 0 to t, we obtain

Z Z t Z 2 Z |∇fs| ft log(ft + c)dω + ds dω 6 f0 log(f0 + c)dω. (13.86) 0 fs + c

ft+c Since ft log 0, we have ft >

Z Z t Z 2 Z |∇fs| ft log ftdω + ds 6 f0 log(f0 + c)dω. (13.87) 0 fs + c Note that both terms on the left hand side are nonnegative. Now we let c → 0. By monotone convergence and Sω(f0) < ∞, we get

Z Z t Z 2 Z |∇fs| ft log ftdµ + ds dω 6 f0 log f0dω (13.88) 0 fs or t Z p Sω(ft) + 4 Dω( fs) ds 6 Sω(f0). (13.89) 0 This is the entropy dissipation inequality in a time integral form. Notice that neither equality nor the differential version like (13.32) is claimed. Note that by integrating (13.85) between t and τ, a similar argument yields t Z p Sω(ft) + 4 Dω( fs) ds 6 Sω(fτ ), t > τ > 0. (13.90) τ In particular the entropy decays:

0 6 Sω(ft) 6 Sω(fτ ), t > τ > 0. (13.91) √ Now we use the LSI (13.76) to estimate Dω( fs) in (13.90) and recall that the LSI holds for any fs since β > 1. We get Z t Sω(ft) + 2K Sω(fs) ds 6 Sω(fτ ), t > τ > 0. (13.92) τ −2Kt A standard calculus exercise shows that Sω(ft) 6 e Sω(f0) for all t > 0. One possible argument is to fix any δ > 0 and choose τ = (n − 1)δ, t = nδ with n = 1, 2,... in (13.92). By monotonicity of the entropy we have (1 + 2Kδ)Sω(fnδ) 6 Sω(f(n−1)δ), for any n, and by iteration we obtain

−n Sω(fnδ) 6 (1 + 2Kδ) Sω(f0) . Setting δ = t/n and letting n → ∞ we get (13.78). This completes the proof of Theorem 13.14.

112 14 Universality of the Dyson Brownian Motion

In this section we assume β > 1 since Dyson Brownian Motion in a strong sense is well defined only for these values of β. However, some results will hold for any β > 0 that we will comment on separately. We consider the Gaussian equilibrium measure µ = µG ∼ exp(−βNH) on ΣN , defined in (12.13) in Section 12.3 with the 1 2 Gaussian potential V (λ) = 2 λ . From (12.14) we recall the corresponding Dirichlet form N Z 1 X 2 1 2 D (f) := (∂ f) dµ = k∇fk 2 , (∂ ≡ ∂ ) , (14.1) µ βN i βN L (µ) i λi i=1 and the generator L = LG defined via Z Dµ(f) = − fLGfdµ , (14.2)

1 and explicitly given by LG = βN ∆ − (∇H) · ∇. The corresponding dynamics is given by (12.17), i.e.,

∂tft = LGft , t > 0 . (14.3) In this section, we will drop all subscripts G. As remarked in the previous section, the Hamiltonian H is convex since the Hessian of the Hamiltonian 2 of µ satisfies ∇ (βNH) > βN/2 by (13.70). Taking the different normalization of the Dirichlet form in (13.25) and (14.1) into account, Theorem 13.6 (actually, its extension to ΣN in Theorem 13.14 with Uj ≡ 0) guarantees that µ satisfies the LSI in the form p Sµ(f) 6 4Dµ( f), and the relaxation time to equilibrium is of order one. Furthermore, we have the exponential convergence to the equilibrium µG in the sense of total variation norm (13.78) at exponential speed on time scales beyond order one for initial data with finite entropy. The following theorem is the main result of this section. It shows that under a certain condition on the rigidity of the eigenvalues the relaxation times of the DBM for observables depending only on the eigenvalue differences are in fact much shorter than order one. The main assumption is the a priori estimate (12.18) which we restate here.

A priori Estimate: There exists ξ > 0 such that

N 1 Z X Q := sup (λ − γ )2f (λ)µ (dλ) CN −2+2ξ (14.4) N j j t G 6 06t6N j=1 with a constant C uniformly in N. We also assume that after time 1/N the solution of the equation (14.3) satisfies the bound m Sµ(f1/N ) 6 CN (14.5) for some fixed m. Later in Lemma 14.6 we will show that for β = 1, 2 this bound automatically holds. Theorem 14.1 (Gap universality of the Dyson Brownian motion for short time). [64, Theorem 4.1] n Let β > 1 and assume (14.5). Fix n > 1 and an array of positive integers, m = (m1, m2, . . . , mn) ∈ N+. Let G : Rn → R be a bounded smooth function with compact support and define   Gi,m(x) := G N(xi − xi+m1 ),N(xi+m1 − xi+m2 ),...,N(xi+mn−1 − xi+mn ) . (14.6)

1 Then for any ξ ∈ (0, 2 ) and any sufficiently small ε > 0, independent of N, there exist constants C, c > 0, depending only on ε and G, such that for any J ⊂ {1, 2,...,N − mn} we have s Z 2 1 X ε N Q −cN ε sup Gi,m(x)(ftdµ − dµ) 6 CN + Ce . (14.7) −1+2ξ+ε |J| |J|t t>N i∈J

113 −1+2ξ+2ε+δ In particular, if (14.4) holds, then for any t > N with some δ > 0, we have Z 1 X 1 −cN ε Gi,m(x)(ftdµ − dµ) C + Ce . (14.8) |J| 6 p δ−1 i∈J |J|N

δ−1 Thus the gap distribution, averaged over J indices, coincides for ftdµ and dµ provided that |J|N → ∞. We stated this result with a very general test function (14.6) because we will need this form later. Obviously, controlling test functions of the form (14.6) for all n is equivalent to controlling test functions of neighboring gaps only (maybe with a larger n), so the reader may think only of the case m1 = 1, m2 = 2, . . . , mn = n. In our applications, J is chosen to be the indices of the eigenvalues in the interval [E − b, E + b] and thus |J|  Nb. This identifies the averaged gap distributions of eigenvalues. However, Theorem 12.4 concerns correlation functions and not gap distributions directly. It is intuitively clear that the information carried in gap distribution or correlation functions are equivalent if both statistics are averaged on a scale larger than the typical fluctuation of a single eigenvalue (which is smaller than N −1+ε in the bulk by (11.32)). This standard fact will be proved in Section 14.5. As pointed out after Theorem 12.4, the input of this theorem, the apriori estimate (14.4), identifies the location of the eigenvalues only on a scale N −1+ξ which is much weaker than the 1/N precision encoded in the rescaled eigenvalue differences in (14.7). Moreover, by the rigidity estimate (11.32), the a priori estimate (14.4) holds for any ξ > 0 if the initial data of the DBM is a generalized Wigner ensemble. Therefore, −1+ε Theorem 14.1 holds for any t > N for any ε > 0 and this establishes Dyson’s conjecture (described in Section 12.3) in the sense of averaged gap distributions for any generalized Wigner matrices.

14.1 Main ideas behind the proof of Theorem 14.1 The key method is to analyze the relaxation to equilibrium of the dynamics (13.30). This approach was first introduced in Section 5.1 of [63]; the presentation here follows [64]. Recall the convexity inequality (13.70) for the Hessian of the Hamiltonian

2 D E 1 1 X (vi − vj) 1 v, ∇2H(x)v kvk2 + kvk2, v ∈ N , (14.9) > 2 N (x − x )2 > 2 R i

N X 1 2 Hb(x) := Uj(xj),Uj(x) := (xj − γj) , (14.10) 2τ j=1 √ i.e., it is a quadratic confinement on the scale τ for each eigenvalue near its classical location, where the parameter 0 < τ < 1 will be chosen later on. The total Hamiltonian is given by

He := H + Hb, (14.11)

1 2 where H is the Gaussian Hamiltonian given by (4.12) with V (x) = 2 x . The measure with Hamiltonian He,

dω := ω(x)dx, ω := e−βNHe/Z,e (14.12)

114 will be called the local relaxation measure. The local relaxation flow is defined to be the flow with the generator characterized by the natural Dirichlet form w.r.t. ω, explicitly, Le: X 0 xj − γj Le = L − bj∂j, bj = U (xj) = . (14.13) j τ j

We will choose τ  1 so that the additional term Hb substantially increases the lower bound (13.70) on the Hessian, hence speeding up the dynamics so that the relaxation time is at most τ.

The idea of adding an artificial potential Hb to speed up the convergence appears to be unnatural here. The current formulation is a streamlined version of a much more complicated approach that appeared in [63] which took ideas from the earlier work [59]. Roughly speaking, in the hydrodynamical limit, the short wavelength modes always have shorter relaxation times than the long wavelength modes. A direct implementation of this idea is extremely complicated due to the logarithmic interaction that couples short and long wavelength modes. Adding a strongly convex auxiliary potential Hb shortens the relaxation time of the long wavelength modes, but it does not affect the short modes, i.e., the local statistics, which are our main interest. The analysis of the new system is much simpler since now the relaxation is faster, uniform for all modes. Finally, we need to compare the local statistics of the original system with those of the modified one. It turns out that the difference is governed by (∇Hb)2 which can be directly controlled by the a priori estimate (14.4). Our method for enhancing the convexity of H is reminiscent of a standard convexification idea concerning metastable states. To explain the similarity, consider a particle near one of the local minima of a double well potential separated by a local maximum, or energy barrier. Although the potential is not convex globally, one may still study a reference problem defined by convexifying the potential along with the well in which the particle initially resides. Before the particle reaches the energy barrier, there is no difference between these two problems. Thus questions concerning time scales shorter than the typical escape time can be conveniently answered by considering the convexified problem; in particular the escape time in the metastability problem itself can be estimated by using convex analysis. Our DBM problem is already convex, but not sufficiently convex. The modification by adding Hb enhances convexity without altering the local statistics. This is similar to the convexification in the metastability problem which does not alter events before the escape time.

14.2 Proof of Theorem 14.1

We will work with the measures µ and ω defined on the simplex ΣN and we present the proof as if Theo- N rem 13.6 and its proof were valid not only for measures on R but also on ΣN . Theorem 13.14 demonstrated that this is indeed true except that the exponential decay to equilibrium was proven only in entropy sense and not Dirichlet form sense as in (13.31). For pedagogical simplicity, we first neglect this technicality and in Section 14.3 we will comment on how to remedy this problem with the help of the regularization introduced in Section 13.7. The core of the proof is divided into three theorems. For the flow with generator Le, we have the following estimates on the entropy and Dirichlet form.

Theorem 14.2. Let β > 1 arbitrary. Consider the forward equation

∂tqt = Leqt, t > 0, (14.14) R with the reversible measure ω defined in (14.12). Let the initial condition q0 satisfy q0dω = 1. Then we have the following estimates

Z N √ √ 2 √ 2 √ 1 X (∂i qt − ∂j qt) ∂ D ( q ) − D ( q ) − dω, (14.15) t ω t 6 τ ω t βN 2 (x − x )2 i,j=1 i j

115 Z ∞ Z N √ √ 2 1 X (∂i qs − ∂j qs) √ ds dω D ( q ), (14.16) βN 2 (x − x )2 6 ω 0 0 i,j=1 i j and the logarithmic Sobolev inequality √ Sω(q0) 6 CτDω( q0) (14.17) with a universal constant C. Thus the relaxation time to equilibrium is of order τ:

−Ct/τ −Ct/τ Sω(qt) 6 e Sω(q0),Dω(qt) 6 e Dω(q0). (14.18) √ Proof. Denote by h = q and by a calculation similar to (13.33) we have the equation Z Z 1 2 −βNHe 2 2 −βNHe ∂tDω(ht) = ∂t (∇h) e dx − ∇h(∇ He)∇h e dx. (14.19) βN 6 βN

In our case, (13.70) and (14.10) imply that the Hessian of He is bounded from below as

2 1 X 2 1 X 1 2 ∇h(∇ He)∇h (∂jh) + (∂ih − ∂jh) . (14.20) > τ 2N (x − x )2 j i,j i j This proves (14.15) and (14.16). The rest can be proved by straightforward arguments analogously to (13.32)–(13.37).

Notice that the estimate (14.16) is an additional information that we extracted from the Bakry-Emery´ argument by using the second term in the Hessian estimate (13.70). It plays a key role in the next theorem.

Theorem 14.3 (Dirichlet form inequality). Let β > 1 be arbitrary. Let q be a probability density with respect to R n the relaxation measure ω from (14.12), i.e., qdω = 1. Fix n > 1, an array of positive integers m ∈ N and a n smooth function G : R → R with compact support as in Theorem 14.1. Then for any J ⊂ {1, 2,...,N −mn} and any t > 0 we have √ Z 1/2 1 X  Dω( q) G (x)(qdω − dω) C t + CpS (q)e−ct/τ . (14.21) |J| i,m 6 |J| ω i∈J

Proof. We give the proof for the β > 1 case here since this is relevant for Theorem 14.1. The general case β > 0 with additional assumptions will be discussed in Section 14.4. For simplicity of the notation, we consider only the case n = 1, m1 = 1, Gi,m(x) = G(N(xi − xi+1)). Let qt satisfy

∂tqt = Leqt, t > 0, with an initial condition q0 = q. We write Z h 1 X i G(N(x − x )) (q − 1)dω (14.22) |J| i i+1 i∈J Z h 1 X i Z h 1 X i = G(N(x − x )) (q − q )dω + G(N(x − x )) (q − 1)dω. |J| i i+1 t |J| i i+1 t i∈J i∈J The second term in (14.22) can be estimated by (13.8), the decay of the entropy (14.18) and the boundedness of G; this gives the second term in (14.21). To estimate the first term in (14.22), by the evolution equation ∂qt = Leqt and the definition of Le we have Z 1 X Z 1 X G(N(x − x ))q dω− G(N(x − x ))q dω |J| i i+1 t |J| i i+1 0 i∈J i∈J Z t Z 1 X = ds G0(N(x − x ))[∂ q − ∂ q ]dω. |J| i i+1 i s i+1 s 0 i∈J

116 √ √ From the Schwarz inequality and ∂q = 2 q∂ q, the last term is bounded by

1/2 " Z t Z 2 2 # N X h 0 i 2 2 ds 2 G (N(xi − xi+1)) (xi − xi+1) qsdω (14.23) N |J| 0 R i∈J 1/2 √ "Z t Z # 1/2 1 X 1 √ √ 2 Dω( q0)t × ds 2 2 [∂i qs − ∂i+1 qs] dω 6 C , N N (x − x ) |J| 0 R i i i+1

2 h 0 i 2 −2 where we have used (14.16) and that G (N(xi − xi+1)) (xi − xi+1) 6 CN due to G being smooth and compactly supported. Alternatively, we could have directly estimated the left hand side of (14.21) by using the total variation norm between qω and ω, which in turn could be estimated by the entropy and the Dirichlet form using the logarithmic Sobolev inequality, i.e., by

Z p q √ C |q − 1|dω 6 C Sω(q) 6 C τDω( q). (14.24)

However, compared with this simple bound, the estimate (14.21) gains an extra factor |J|  N in the denominator, i.e., it is in terms of Dirichlet form per particle. The improvement is due to the fact that the observable in (14.21) depends only on the gap, i.e., difference of points. This allows us to exploit the additional term (14.16) gained in the Bakry-Emery´ argument. This is a manifestation of the general observation that gap-observables behave much better than point-observables.

The final ingredient in proving Theorem 14.1 is the following entropy and Dirichlet form estimates.

Theorem 14.4. Let β > 1 be arbitrary. Suppose that (13.70) holds and recall the definition of Q from (12.18). Fix some (possibly N-dependent) τ > 0 and consider the local relaxation measure ω with this τ. Set ψ := ω/µ and let gt := ft/ψ with ft solving the evolution equation (14.3). Suppose there is a constant m such that

m S(fτ µ|ω) 6 CN . (14.25) Fix any small ε > 0. Then for any t ∈ [τN ε,N] the entropy and the Dirichlet form satisfy the estimates:

2 −1 √ 2 −2 S(gtω|ω) 6 CN Qτ ,Dω( gt) 6 CN Qτ , (14.26) where the constants depend on ε and m.

Remark 14.5. It will not be difficult to check that if the initial data is given by a Wigner ensemble, then (14.25) holds without any assumption, see Lemma 14.6.

Proof of Theorem 14.4. The evolution of the entropy S(ftµ|ω) = S(ftµ|ψµ) can be computed by Lemma 13.11 as 4 X Z √ Z ∂ S(f µ|ω) = − (∂ g )2 ψ dµ + (Lg )ψ dµ, t t βN j t t j where L is defined in (14.3) and we used that ψ = ω/µ is time independent. Hence we have, by using (14.13), Z Z Z 4 X √ 2 X ∂tS(ftµ|ω) = − (∂j gt) dω + Legt dω + bj∂jgt dω. βN j j

Since ω is L-invariant and time independent, the middle term on the right hand side vanishes. From the e √ √ Schwarz inequality and ∂g = 2 g∂ g, we have Z √ X 2 √ 2 −2 ∂tS(ftµ|ω) 6 −2Dω( gt) + CN bj gt dω 6 −2Dω( gt) + CN Qτ . (14.27) j

117 Notice that (14.27) is reminiscent to (13.32) for the derivative of the entropy of the measure gtω = ftµ with respect to ω. The difference is, however, that gt does not satisfy the evolution equation with the generator Le. The last term in (14.27) expresses the error. Together with the logarithmic Sobolev inequality (14.17), we have

√ 2 −2 −1 2 −2 ∂tS(ftµ|ω) 6 −2Dω( gt) + CN Qτ 6 −Cτ S(ftµ|ω) + CN Qτ . (14.28)

ε Integrating the last inequality from τ to t and using the assumption (14.25) and t > τN , we have proved the first inequality of (14.26). Using this result and integrating (14.27), we have

Z t √ 2 −1 Dω( gs)ds 6 CN Qτ . (14.29) τ Notice that Z 2 Z 2 p 1 |∇(gsψ)| dω C h|∇gs| 2 i √ 2 −2 Dµ( fs) = 6 + |∇ log ψ| gs dω = CDω( gs) + CN Qτ (14.30) βN gsψ ψ βN gs

by a Schwarz inequality. Thus from (14.29), after restricting the integration and using t > 2τ, we get Z t Z t 2 −1 √  1 p 2 −2 τ p 2 −1 CN Qτ > Dω( gs)ds > Dµ( fs) − CN Qτ ds > Dµ( ft) − CN Qτ , (14.31) t−τ t−τ C C √ where, in the last step, we used that Dµ( ft) is decreasing in t which follows from the convexity of the √ √ 2 −2 Hamiltonian of µ, see e.g. (13.33). Using the opposite inequality Dω( gt) 6 CDµ( ft) + CN Qτ , that can be proven similarly to (14.30), we obtain

√ 2 −2 Dω( gt) 6 CN Qτ , i.e., the second inequality of (14.26).

Finally, we complete the proof of Theorem 14.1. For any given t > 0 we now choose τ := tN −ε and we construct the local relaxation measure ω with this τ as in (14.12). Set ψ = ω/µ and let q := gt = ft/ψ be the density q in Theorem 14.3. We would like to apply Theorem 14.4 and for this purpose we need to verify the assumption (14.25). By the definitions of ω, µ and ψ = ω/µ, we have Z Z Z S(fτ µ|ω) = fτ log fτ dµ − fτ log ψdµ = Sµ(fτ ) − fτ log ψdµ. (14.32)

We can bound the last term by

Z Z Ze − f log ψdµ CN f W dµ + log CN 2Qτ −1 CN m (14.33) τ 6 τ Z 6 6 for some m, where we have used that Ze 6 Z since He > H. To verify the assumption (14.25), it remains to m prove that Sµ(fτ ) 6 CN for some m. This inequality follows from the following lemma.

Lemma 14.6. Let β = 1, 2. Suppose the initial data f0 of the DBM is given by the eigenvalue distribution of a Wigner matrix. Then for any τ > 0 we have

2 −τ  Sµ(fτ ) 6 CN (1 − log 1 − e ) . (14.34)

Proof. For simplicity, we consider only the case β = 1, i.e., the real Wigner matrices. Recall that the probability measure fτ µ, is the same as the eigenvalue distribution of the Gaussian divisible matrix (12.21):

−τ/2 −τ 1/2 G Hτ = e H0 + (1 − e ) H , (14.35)

118 G where H0 is the initial Wigner matrix and H is an independent standard GOE matrix. Since the entropy is monotonic w.r.t. taking a marginal, (13.21), we have

G Sµ(fτ ) = S(fτ µ|µ) 6 S(µHτ |µH ),

G G where µHτ and µH are the laws of the matrix Hτ and H respectively. Since the laws of both µHτ and µHG are given by the product of the laws of the matrix elements, from the additivity of entropy (13.11), G S(µHτ |µH ) is equal to the sum of the relative entropies of the matrix elements. Recall that the variances of off-diagonal and diagonal entries for GOE differ by a factor 2. For simplicity of notations, we consider only −τ the off-diagonal terms. Let γ = 1 − e and denote by gα the standard Gaussian distribution with variance α, i.e., 2 1 x  gα(x) := √ exp − . 2πα 2α 1/2 −τ/2 Let %γ be the probability density of (1 − γ) ζ = e ζ where ζ is the random variable for an off-diagonal matrix element. By definition, the probability density of the matrix element of Hτ is given by ζτ = %γ ∗g2γ/N . Therefore Jensen’s inequality yields Z  Z  S(ζτ |g2/N ) = S dy %γ (y) g2γ/N (· − y) g2/N dy %γ (y)S g2γ/N (· − y)|g2/N . (14.36) 6 By explicit computation one finds

2 2  s y σ 1 S g 2 (· − y)|g 2 = log + + − . σ s σ 2s2 2s2 2 In our case, we have 1N  Sg (· − y)|g  = y2 − log γ + γ − 1 . 2γ/N 2/N 2 2

R 2 We can now continue the estimate (14.36). Using y %γ (y)dy = 2/N, we obtain

S(ζτ |g2/N ) 6 C − log γ , and the claim follows. We now return to the proof of Theorem 14.1. We can apply Lemma 14.6 with the choice τ = N −1+2ξ, 1 where ξ ∈ (0, 2 ) is from the assumption (14.4). Together with (14.32)–(14.33), this implies that (14.25) holds. Thus Theorem 14.4 and Theorem 14.3 imply for any t ∈ [τN ε,N] that

Z √ !1/2 1 X Dω( q) ε G (x)(f dµ − dω) C t + CpS (q)e−cN (14.37) N m,i t 6 |J| ω i∈J 1/2 s 2 ! 2 N Q ε N Q ε C t + Ce−cN CN ε + Ce−cN , 6 |J|τ 2 6 |J|t

i.e., the local statistics of ftµ and ω are the same for any initial data fτ .

Applying the same argument to the Gaussian initial data, f0 = fτ = 1, we can also compare µ and ω. We −1+2ξ+δ+2ε have thus proved the estimate (14.7). Finally, if t > N , then the assumption (14.4) guarantees that, s N 2Q 1 6 , |J|t p|J|N δ−1 which proves Theorem 14.1.

119 14.3 Restriction to the simplex via regularization In the previous section we tacitly assumed that the Bakry-Emery´ analysis from Section 13.3 extends to Σ. Strictly speaking, this is not fully rigorous. There are two options to handle this technical point. As we remarked after Theorem 13.14, there is a direct method via cutoff functions to make the Bakry- ´ Emery argument rigorous on Σ if β > 1 (see Appendix B of [64]). The same technique can also make every step in Section 14.2 rigorous on Σ. N We present here a more transparent argument that relies on the regularized measure ωδ on R already used in the proof of the LSI in Theorem 13.14. This path is technically simpler since it performs the core of the analysis on RN , and it uses an extra input, the strong solution to the DBM, Theorem 12.2, to remove the regularization at the end. We now explain the correct procedure of the proof of Theorem 14.1 using this regularization. The goal is to prove (14.37). For any small δ > 0 we introduce the regularized versions µδ and ωδ of the measures µ and ω as in N Section 13.7. Let Lωδ be the generator of the regularized measure ωδ on R and let ft,δ be the solution to N ∂tft,δ = Lωδ ft,δ on R with the same initial condition as for δ = 0, i.e., f0,δ is the extension of f0 to the N entire R with f0,δ(x) = 0 for x 6∈ ΣN . The three key theorems, Theorems 14.2–14.4, of Section 14.2 were formulated and proven for the regular- ized setup on RN , so they can be applied to the regularized dynamics. With the notations in these theorems, the conclusion is that if m S(fτ,δµδ|ωδ) 6 CN (14.38) (see (14.25)), then s Z 2 1 X N Qδ ε G (x)(f dµ − dω ) CN ε + Ce−cN , t ∈ [τN ε,N], (14.39) N m,i t,δ δ δ 6 |J|t i∈J

where Qδ is defined exactly as in (14.4) with the regularized measures, i.e.,

N 1 Z X Q := sup (x − γ )2f (x)µ (dx). (14.40) δ N j j t,δ δ 06t6N j=1

Now we let δ → 0. Since ωδ converges weakly to ω in the sense that ωδ(A) → ω(A ∩ ΣN ) for any subset N R R A ∈ R , we have O(x)dωδ → O(x)dω for any bounded observable O. Setting 1 X O(x) = G (x), (14.41) N m,i i∈J

we see that the second term within the integral in the left hand side of (14.39) converges to the corresponding term in (14.37). On the right hand side a similar argument shows that Qδ → Q as δ → 0. Strictly speaking, 2 the observable (xj − γj) is unbounded, but a very simple cutoff argument shows that

ft,δ µδ 2 −cN max |xj| N e P j > 6

uniformly in δ and t ∈ [0,N]. For the first term in the left side of (14.39), we use the stochastic representation of the solution to ∂tft,δ = Lω ft,δ to note that δ Z f0ω x0 O(x)ft,δ(x)dωδ = E Eδ O(x(t)), (14.42) N R x0 where Eδ denotes the expectation with respect to the law of the regularized DBM (x(t))t, see (13.82), f0ω starting from x0, and E denotes the expectation of x0 with respect to the measure f0ω. Now we use the existence of the strong solution to the DBM (13.74) for any β > 1, that can be obtained exactly as in Theorem 12.2 for the Uj ≡ 0 case. Since x(t) is continuous and it remains in the open set ΣN , the probability

120 that up to a fixed time t it remains away from a δ-neighborhood of the boundary of ΣN goes to 1 as δ → 0, i.e.,   ωδ lim P dist(x(s), ∂ΣN ) δ : s ∈ [0, t] = 1. δ→0 >

Notice that (13.74) and (13.82) are exactly the same for paths that stay away from the boundary of ΣN . This means that the right hand side of (14.42) converges to Ef0ωEx0 O(x(t)), where Ex0 denotes expectation with respect to the law of (13.74). This proves that Z Z lim O(x)ft,δ(x)dωδ = O(x)ft(x)dω , (14.43) δ→0 N R ΣN which completes the proof of (14.37). Finally, we need to verify the condition (14.38). Following (14.32)–(14.33), we only need to show that m Sµδ (fτ,δ) 6 CN for sufficiently small δ > 0. We first run the DBM for a very short time and replace the −10 original matrix ensemble H0 by Hσ, its tiny Gaussian convolution of variance σ = N , see (14.35). This is −1 not a restriction, since Theorem 14.1 anyway concerns ft for much larger times, t  N . The corresponding 12 eigenvalue density fσ supported on ΣN satisfies Sµ(fσ) 6 CN by (14.34). As a second cutoff, for some large R K we set fbσ,K := CK max{fσ,K} with a normalization constant CK such that fbσ,K dµ = 1 and CK → 1 as K → ∞. We will actually apply Theorem 14.1 to fbσ,K as an initial condition. The regularized flow also

starts with this initial condition, i.e., we define ft,δ to be the solution of ∂tft,δ = Lµδ ft,δ with f0,δ := fbσ,K .

Since the entropy decays along the regularized flow, we immediately have Sµδ (fτ,δ) 6 Sµδ (f0,δ) = Sµδ (fbσ,K ).

Since fbσ,K is supported on ΣN and it is bounded, clearly Sµδ (fbσ,K ) converges to Sµ(fbσ,K ) as δ → 0. We then have Z   Sµ(fbσ,K ) = CK max{fσ,K} log max{fσ,K} + log CK dµ → Sµ(fσ)

as K → ∞ by monotone convergence. Thus first choosing K sufficiently large so that Sµ(fbσ,K ) 6 2Sµ(fσ) 6 12 12 CN , then choosing δ sufficiently small, we achieved Sµδ (fτ,δ) 6 CN . This verifies the condition (14.38) for some sufficiently small δ, so that (14.39) holds. Letting δ → 0, we obtain (14.7) which completes the fully rigorous proof of Theorem 14.1.

14.4 Dirichlet form inequality for any β > 0

Here we extend the proof of the key Dirichlet form inequality, Theorem 14.3, from β > 1 to any β > 0. This extension will not be needed for this book, so the reader may skip this section, but we remark that the Dirichlet form inequality for β > 0 played a crucial role in the proof of the universality for invariant ensembles for any β > 0, see [25]. √ Lemma 14.7. Let β > 0 and q be a probability density with respect to ω defined in (14.12) with ∇ q ∈ L∞(dω) ∞ and q ∈ L (dω). Then for any J ⊂ {1, 2,...,N − mn − 1} and any t > 0 we have

Z Z s √ 1 X 1 X Dω( q) t p −ct/τ Gi,m q dω − Gi,m dω C + C Sω(q) e . (14.44) |J| |J| 6 |J| i∈J i∈J

√ ∞ ∞ For β > 1, the conditions ∇ q ∈ L and q ∈ L (dω) can be removed. We emphasize that this lemma holds for any β > 0, i.e., it does not (directly) rely on the existence of the DBM. The parameter t is not connected with the time parameter of a dynamics on ΣN (although it emerges as a time cutoff in a regularized dynamics on RN within the proof).

Proof of Lemma 14.7. First we record the result if ω on ΣN is replaced with the regularized measure ωδ on N P R , as defined in (13.81) with the choice of Hb(x) = j Uj(xj), where Uj is given in (14.10). In this case the proof of Theorem 14.3 applies to the letter even for β > 0 since now we work on RN and complications

121 arising from the boundary are absent. We immediately obtain for any probability density q on RN with R b qbdωδ = 1, for any J ⊂ {1, 2,...,N − mn − 1} and any t > 0 that s Z 1 Z 1 D (pq) t X X ωδ b p −ct/τ Gi,m q dωδ − Gi,m dωδ C + C Sω (q) e . (14.45) |J| b |J| 6 |J| δ b i∈J i∈J √ 1 Suppose now that q ∈ H (dω) and q is a bounded probability density in ΣN with respect to ω. Similarly to the proof of (13.76) for the general β > 0 case, we may extend q to Σe by symmetrization. Let qe denote this extension which is bounded and ∇fe1/2 is also bounded since q has these properties. Then there is a constant Cδ such that qδ := Cδ q is a probability density with respect to ωδ. Since q is bounded, we have R R e e N qdωδ → qdω by dominated convergence, and thus Cδ → 1 as δ → 0. R e ΣN Now we apply (14.45) qb = qδ. Taking the limit δ → 0, the left hand side converges to that of (14.44) since Gi,m is a bounded smooth function and qδdωδ = Cδqedωδ converges weakly to q(ω)1(ω ∈ ΣN )dω by dominated convergence. We thus have

Z 1 X Z 1 X G q dω − G dω |J| i,m |J| i,m i∈J i∈J s C D (pq) t C lim sup δ ωδ e + C lim sup pS (C q) e−ct/τ . (14.46) 6 ωδ δ e δ→0 |J| δ→0

1/2 ∞ 1/2 √ By dominated convergence, using that ∇fe ∈ L , we have Dωδ (fe ) → Dω( f). Similarly, the entropy term also converges to S (q) by q ∈ L∞. ω e √ This proves (14.44) under the condition q ∈ L∞, ∇ q ∈ L∞. Finally, these conditions can be removed for √ β > 1 by a simple approximation. We may assume Dω( q) < ∞; otherwise (14.44) is an empty statement. ∞ By the LSI (13.76) we also know that Sω(q) < ∞. First, we still keep the assumption that q ∈ L . Since ∞ 1 ∞ C0 (Σ) is dense in H (dω) (see Section 12.4), we can find a sequence of densities qn ∈ L (Σ) such that √ 2 √ √ 2 √ √ 2 ∇ qn ∈ L (dω) and qn → q in L (dω) and ∇ qn → ∇ q in L (dω). In fact, the construction in Section 12.4 guarantees that, apart from an irrelevant smoothing, qn may be chosen of the form qn = Cnφnq, where φn is a cutoff function with 0 6 φn 6 1, converging to 1 pointwise and Cn → 1. We know that (14.44) holds for every qn. Taking the limit n → ∞, the left hand side converges since Z √ √ √ √

O(qn − q)dω 6 kOk∞k qn − qkL2(dω)k qn + qkL2(dω) → 0, √ √ √ √ √ where we have used that k q + qk k q k +k qk and k qk = R qdω = 1. Here O is given by (14.41). n 2 6 n 2 2 2 √ √ For the right hand side of (14.44) applied to the approximate function qn, we note that Dω( qn) → Dω( q) directly by the choice of qn. Finally, we need to show that Sω(qn) → Sω(q) as n → ∞. Clearly, Z Z     Sω(qn) − Sω(q) = qn log qn − q log q dω = Cnφnq log(Cnφn) + (Cnφn − 1)q log q dω, Σ Σ

and both integrals go to zero by dominated convergence and by Sω(q) < ∞. This proves (14.44) for any q ∈ ∞ L . Finally, this last condition can be removed by approximating any density q with qM := CM max{q, M}, where CM is the normalization and CM → 1 as M → ∞. Similarly to the proof in (13.84), one can show that Sω(qM ) → Sω(q) and Dω(qM ) → Dω(q). Thus we can pass to the limit in the inequality (14.44) written up for qM . This proves Lemma 14.7. Notice that the proof did not use the existence of the DBM; instead, it used the existence of the regularized DBM.

14.5 From gap distribution to correlation functions: proof of Theorem 12.4. Theorem 12.4 follows from Theorem 14.1 if we can show that the correlation function difference in (12.23) can be bounded in terms of the difference of the expectation of gap observables in (14.8). We state it as the following lemma which clearly proves Theorem 12.4 with Theorem 14.1 as an input.

122 Lemma 14.8. Let f be a probability density with respect to the Gaussian equilibrium measure (12.13). Suppose that the estimate (12.22) holds with some exponent ξ > 0 and that

Z 1 X 1 Gi,m(x)(fdµ − dµ) C (14.47) |J| 6 p δ−1 i∈J |J|N

also holds for some δ > 0 and for any J ⊂ {1, 2,...,N − mn}, where Gi,m is defined in (14.6). Then for −1 any ε > 0 and N large enough we have for any b = bN with N  b  1 that r Z E+b dE0 Z   α  hN −1+ξ N −δ i (n) (n) 0 2ε (14.48) dα O(α) pfµ,N − pµ,N E + 6 N + . 2b n N% (E) b b E−b R sc This is a fairly standard technical argument, and the details will be given in Section 14.6. Here we just summarize the main points. To understand the difference between (14.47) and (14.48), consider n = 1 for simplicity and let m1 = 1, say. The observable (14.7) answers the question: “What is the empirical distribu- tion of differences of consecutive eigenvalues?”, in other words (14.7) directly identifies the gap distribution. Correlation functions answer the question: “What is the probability that there are two eigenvalues at a fixed distance away from each other?”, in other words they are not directly sensitive to the possible other eigenvalues in between. Of course these two questions are closely related and it is easy to deduce the answers from each other under some reasonable assumptions on the distribution of the eigenvalues. We now prove that the correlation functions can be derived from (generalized) gap distributions which is the content of Lemma 14.8.

14.6 Details of the proof of Lemma 14.8 By definition of the correlation function in (12.19), we have

Z E+b 0 Z dE (n)  0 α  dα O(α)pfµ,N E + 2b n N E−b R Z E+b 0 Z dE X 0  =CN,n O N(xi1 − E ),N(xi1 − xi2 ),...N(xin−1 − xin ) fdµ, E−b 2b i16=i26=...6=in Z E+b 0 Z N dE X X 0 =CN,n Yi,m(E , x)fdµ, (14.49) E−b 2b m∈Sn i=1

n −1 with CN,n := N (N − n)!/N! = 1 + On(N ), where we let Sn denote the set of increasing positive integers, n−1 m = (m2, m3, . . . , mn) ∈ N+ , m2 < m3 < . . . < mn, and we introduced 0 0  Yi,m(E , x) := O N(xi − E ),N(xi − xi+m2 ),...,N(xi − xi+mn ) . (14.50)

(n) We will set Yi,m = 0 if i + mn > N. By permutational symmetry of pf,N we can assume that O is symmetric and thus we restrict the summation to i1 < i2 < . . . < in upon an overall factor n!. Then we changed indices i = i1, i2 = i + m2, i3 = i + m3,..., and performed a resummation over all index differences encoded in m. 0 Apart from the first variable N(xi1 − E ), the function Yi,m is of the form (14.6), so (14.47) will apply. The dependence on the first variable will be negligible after the dE0 integration on a macroscopic interval. To control the error terms in this argument, especially to show that the error terms in the potentially infinite sum over m ∈ Sn converge, one needs an apriori bound on the local density. But this information is provided very precisely by the bulk rigidity estimate (12.22). The details for the rest of the argument will be presented in the next subsection. Since (14.49) also holds if we set f = 1, to prove Lemma 14.8, we only need to estimate

Z E+b dE0 Z N X X 0 Θ := Yi,m(E , x)(f − 1)dµ . (14.51) E−b 2b m∈Sn i=1

123 Let M be an N-dependent parameter chosen at the end of the proof. Let

c Sn(M) := {m ∈ Sn , mn 6 M},Sn(M) := Sn \ Sn(M),

n−1 (1) (2) (3) and note that |Sn(M)| 6 M . We have the simple bound Θ 6 ΘM + ΘM + ΘM , where

Z E+b dE0 Z N (1) X X 0 ΘM := Yi,m(E , x)(f − 1)dµ (14.52) E−b 2b m∈Sn(M) i=1 and Z E+b 0 Z N (2) X dE X Θ := Y (E0, x)fdµ . (14.53) M 2b i,m c E−b m∈Sn(M) i=1 (3) (2) We define ΘM to be the same as ΘM but with f replaced by the constant 1, i.e., the equilibrium measure µ. (1) Step 1: Small m case; estimate of ΘM . After performing the dE0 integration, we will eventually apply Theorem 14.1 to the function Z   G u1, u2,... := O y, u1, u2,... dy, R i.e., to the quantity Z 1   dE0 Y (E0, x) = G N(x − x ),... (14.54) i,m N i i+m2 R for each fixed i and m. ± ± For any E and 0 < ζ < b define sets of integers J = JE,b,ζ and J = JE,b,ζ by  ±  J := i : γi ∈ [E − b, E + b] ,J := i : γi ∈ [E − (b ± ζ),E + b ± ζ] ,

− + where γi was defined in (11.31). Clearly, J ⊂ J ⊂ J . With these notations, we have

Z E+b N Z E+b 0 X 0 0 X 0 + dE Yi,m(E , x) = dE Yi,m(E , x) + ΩJ,m(x). (14.55) E−b i=1 E−b i∈J +

+ + The error term ΩJ,m, defined by (14.55) indirectly, comes from those i 6∈ J indices, for which xi ∈ −1 0 0 [E −b, E +b]+O(N ) since Yi,m(E , x) = 0 unless |xi −E | 6 C/N, the constant depending on the support of O. Thus + −1 |ΩJ,m(x)| 6 CN #{ i : |xi − γi| > ζ/2, |xi − E| 6 2b} for any sufficiently large N assuming ζ  1/N and using that O is a bounded function. The additional N −1 0 factor comes from the dE0 integration. Due to the rigidity estimate (12.22) and choosing ζ = N −1+ξ+ε with some ε0 > 0, we get Z + −D |ΩJ,m(x)|fdµ 6 N (14.56) for any D. We can also estimate Z E+b Z E+b 0 X 0 0 X 0 −1 + − dE Yi,m(E , x) 6 dE Yi,m(E , x) + CN |J \ J | E−b i∈J + E−b i∈J − Z 0 X 0 −1 + − + = dE Yi,m(E , x) + CN |J \ J | + ΞJ,m(x) (14.57) R i∈J − Z 0 X 0 −1 + − −1 − + 6 dE Yi,m(E , x) + CN |J \ J | + CN |J \ J | + ΞJ,m(x), R i∈J

124 + − where the error term ΞJ,m, defined by (14.57), comes from indices i ∈ J such that xi 6∈ [E − b, E + b] + + O(1/N). It satisfies the same bound (14.56) as ΩJ,m. By the continuity of %, the density of γi’s is bounded + − − by CN, thus |J \ J | 6 CNζ and |J \ J | 6 CNζ. Therefore, summing up the formula (14.54) for i ∈ J, we obtain from (14.55), (14.56) and (14.57)

N Z E+b Z X Z 1 X   dE0 Y (E0, x)fdµ G N(x − x ),... fdµ + Cζ + CN −D i,m 6 N i i+m2 E−b i=1 i∈J

for each m ∈ Sn. A similar lower bound can be obtained analogously, and we get

N Z E+b Z X Z 1 X   dE0 Y (E0, x)fdµ − G N(x − x ),... fdµ Cζ + CN −D (14.58) i,m N i i+m2 6 E−b i=1 i∈J

−D for each m ∈ Sn. The error term N can be neglected. Adding up (14.58) for all m ∈ Sn(M), we get

N Z E+b Z X X Z X 1 X   dE0 Y (E0, x)fdµ − G N(x − x ),... fdµ i,m i i+m2 E−b N m∈Sn(M) i=1 m∈Sn(M) i∈J n−1 6 CM ζ . (14.59) Clearly, the same estimate holds for the equilibrium, i.e., if we set f = 1 in (14.59). Subtracting these two formulas and applying (14.47) to each summand on the second term in (14.58), we conclude that

E+b 0 N Z dE Z  0  (1) X X 0 n−1 −1 −1+ξ+ε −1/2 −δ/2 ΘM = Yi,m(E , x)(fdµ−dµ) 6 CM b N +b N , (14.60) E−b 2b m∈Sn(M) i=1

(1) where we have used that |J| 6 CNb. This completes the estimate of ΘM .

(2) (3) Step 2. Large m case; estimate of ΘM and ΘM .

For a fixed y ∈ R, ` > 0, let

N X  ` `  χ(y, `) := 1 x ∈ y − , y +  i N N i=1

denote the number of points in the interval [y − `/N, y + `/N]. Note that for a fixed m = (m2, . . . , mn), we have N N X 0 0  0  X  0  |Yi,m(E , x)| 6 C · χ(E , `) · 1 χ(E , `) > mn 6 C m · 1 χ(E , `) > m , (14.61) i=1 m=mn where ` denotes the maximum of |u1| + ... + |un| in the support of O(u1, . . . , un). n−1 Since the summation over all increasing sequences m = (m2, . . . , mn) ∈ N+ with a fixed mn contains n−2 at most mn terms, we have

Z E+b Z N Z E+b N Z   X 0 X 0 0 X n−1 0 dE Yi,m(E , x)fdµ 6 C dE m 1 χ(E , `) > m fdµ. (14.62) c E−b E−b m∈Sn(M) i=1 m=M

The rigidity bound (12.22) clearly implies Z  0  −D 1 χ(E , `) > m fdµ 6 CD,ε0 N

125 0 ε0 for any D > 0 and ε > 0, as long as m > `N . Therefore, choosing D = 2n + 2, we see from (14.62) that Θ(2) is negligible, e.g. (2) −1 ΘM 6 CN ε0 (3) if M > `N . Since all rigidity bounds hold for the equilibrium measure, the last bound is valid for ΘM as 0 well. We will choose M = `N ε and recall that ` is independent of N. Combining these bounds with (14.60), we obtain ε0(n+1) −1 −1+ξ+ε0 −1/2 −δ/2 −1 Θ 6 CN b N + b N + CN . Choosing ε0 = ε/(n + 1), we conclude the proof of Lemma 14.8.

126 15 Continuity of local correlation functions under matrix OU process

We have completed the first two steps of the three step strategy introduced in Section 5, i.e., the local semicircle law and the universality of Gaussian divisible ensembles (Theorem 12.4). In this section, we will complete this strategy by proving a continuity result for the local correlation functions of the matrix OU process in the following Theorem 15.2 and Lemma 15.3. This is the Step 3a defined in Section 5. From these results, we obtain a weaker version of Theorem 5.1, namely we get averaged energy universality of −1/2+ε Wigner matrices but only on scale b > N . In Section 16.1, we will use the idea of ”approximation by −1+ε a Gaussian divisible ensemble” and prove Theorem 5.1 down to any scale b > N . Theorem 15.1. [68, Theorem 2.2] Let H be an N × N real symmetric or complex Hermitian Wigner matrix. In the Hermitian case we assume√ that the real and imaginary parts are i.i.d. Suppose that the distribution ν of the rescaled matrix elements Nhij satisfies the decay condition (5.6). Fix a small ε > 0, an integer n n > 1 and let O : R → R be a continuous, compactly supported function. Then for any |E| < 2 and b ∈ [N −1/2+ε,N −ε], we have

Z E+b Z 1 0  (n) (n)  0 α  lim dE dα O(α) pH,N − pG,N E + = 0. (15.1) N→∞ 2b n N E−b R To prove this theorem, we first recall the matrix OU process (12.1) defined by

1 1 dHt = √ dBt − Htdt, (15.2) N 2

with the initial data H0. The eigenvalue evolution of this process is the DBM and recall that we denote the eigenvalue distribution at the time t by ftdµ with ft satisfying (12.17). In this section, we assume that the initial data H0 is a N × N Wigner matrix and the distribution of matrix element satisfies the uniform polynomial decay condition (5.6). We have the following Green function continuity theorem for the matrix OU process.

Theorem 15.2 (Continuity of Green function). Suppose that the initial data H0 is a N × N Wigner matrix with the distribution of matrix element satisfying the uniform polynomial decay condition (5.6). Let κ > 0 be −1+σ arbitrary and suppose that for some small parameter σ > 0 and for any y > N , we have the following estimate on the diagonal elements of the resolvent for any 0 6 t 6 1:   1 2σ max max ≺ N (15.3) 16k6N |E|62−κ Ht − E − iy kk with some constants C, c depending only on σ, κ. −1 −1 Let G(t, z) = (Ht − z) denote the resolvent and m(t, z) = N Tr G(t, z). Suppose that F (x1, . . . , xn) 0 is a function such that for any multi-index α = (α1, . . . , αn) with 1 6 |α| 6 5 and for any ε > 0 sufficiently small, we have  0  0 α ε C0ε max |∂ F (x1, . . . , xn)| : max |xj| N N (15.4) j 6 6 and   α 2 C0 max |∂ F (x1, . . . , xn)| : max |xj| N N (15.5) j 6 6

for some constant C0. −1−ε −1 Let ε > 0 be arbitrary and choose an η with N 6 η 6 N . For any sequence of complex parameters zj = Ej ± iη, j = 1, . . . , n, with |Ej| 6 2 − κ, there is a constant C, such that for any choices of the signs in the imaginary part of zj and for any t ∈ [0, 1], we have √ 20σ+6ε EF (m(t, z1),...,m (t, zn)) − EF (m(0, z1),...,m (0, zn)) 6 CN (t N). (15.6)

127 We defer the proof of Theorem 15.2 to the next subsection. The following result shows how the Green function continuity, i.e., estimate of the type (15.6), can be used to compare correlation functions. The proof will be given in Section 15.2. Next we will compare local statistics of two Wigner matrix ensembles. We will use the labels v and w to distinguish them because later in Section 16 we will denote the matrix elements of the two ensembles by v w different letters, vij and wij. For any two (generalized) Wigner matrix ensembles H and H , we denote v w v w the probability laws of their eigenvalues λ and λ by µv and µw respectively. Denote by m (z), m (z) the Stieltjes transforms of the eigenvalues, i.e.,

N 1 X 1 mv(z) = , N λv − z j=1 j

v (n) (n) and similarly for m (z). Let pv,N and pw,N be the n-point correlation functions of the eigenvalues w.r.t. µv v and µw. We remind the reader that sometimes we will use the notation E F (λ) to denote the expectation of EF (λv). Although the setup is motivated by local correlation functions of eigenvalues, we point out that the concept of matrices or eigenvalues plays no role in the following theorem. It is purely a statement about comparing two point processes on the real line, asserting that if the finite dimensional marginals of the Stieltjes transforms are close then the local correlation functions are also close. It is essential that in (15.8) below the spectral parameter of the Stieltjes transform must be at distance N −1−σ to the real axis for some σ > 0, i.e., below the scale of the eigenvalue spacing. Controlling the Stieltjes transform on this very short scale is necessary to identify and compare local correlation functions at the scale N −1. The following theorem is a slightly modified version of Theorem 6.4 [69]. Theorem 15.3 (Correlation function comparison). Let κ > 0 be arbitrary and suppose that for some small parameters σ, δ > 0 the following two conditions hold. (i) For any ε > 0 and any k integer

 v −1+ε k  w −1+ε k E Im m (E + iN ) + E Im m (E + iN ) 6 C (15.7)

holds for any |E| 6 2 − κ and N > N0(ε, k, κ);

−1−σj (ii) For any sequence zj = Ej + iηj, j = 1, . . . , n with |Ej| 6 2 − κ and ηj = N for some σj 6 σ, we have v v w w −δ E (Im m (z1) ··· Im m (zn)) − E (Im m (z1) ··· Im m (zn)) 6 N . (15.8)

Then for any integer n > 1 there are positive constants cn = cn(σ, δ) such that for any |E| 6 2 − 2κ and for any C1 function O : Rn → R with compact support, Z   α  (n) (n) −cn dα O(α) pv,N − pw,N E + 6 CN , (15.9) n N R where C depends on O and N is sufficiently large. We remark that in some applications we will use slightly different conditions. Instead of (15.8) we may assume v v w w −δ EF (Im m (z1),..., Im m (zn)) − EF (Im m (z1),..., Im m (zn)) 6 N , (15.10)  w w  where F is as in Theorem 15.2. Then (15.8) holds since we can approximate E Im m (z1) ... Im m (zn) by c the expression in (15.10) where the function F is chosen to be F (x1, . . . , xn) := x1x2 . . . xn if maxj |xj| 6 N 2c and it is smoothly cutoff to go to zero in the regime maxj |xj| > N for some small c > 0. To see this, recall the elementary inequality,

η2 η2 θη1 6 θη2 , implying Im m(E + iη1) 6 Im m(E + iη2) (15.11) η1 η1

128 for any η2 > η1 > 0. Hence (15.7) implies that

 v k k E Im m (z) 6 C(Nη) (15.12) for any k > 0. Let η = N −1+a, a > 0, and consider n = 1 for simplicity. By choosing c = 2a,

v v v v c −c(k−1)  v k ka−c(k−1) −δ E|F (Im m (z))−Im m (z)| 6 E Im m (z)1(Im m (z) > N ) 6 N E Im m (z) 6 N 6 N provided that k is large enough. Similar arguments can be used for general n and this implies that the difference between the expectation of F and the product is negligible. This proves that the condition (15.8) can be replaced by the condition (15.10). Notice that we pass through the continuity of traces of Green functions as an intermediate step to get continuity of the correlation functions. If we choose to follow the evolution of the correlation functions directly by differentiating them, it will involve higher derivatives of eigenvalues. From the formulas of derivatives of eigenvalues and eigenvectors (12.9), (12.10), higher derivatives of eigenvalues involve singularities of the −n form (λi − λj) for some positive integers n. These singularities are very difficult to control precisely. Our approach to use Green function as an intermediate step completely avoids this difficulty because Green function has a natural regularization parameter, the imaginary part of the spectral parameter.

Proof of Theorem 15.1. Recall the matrix OU (15.2) can be solved by the formula (12.21) so that the probability distribution of the matrix OU is given by a Wigner matrix ensemble if the initial data is a Wigner matrix ensemble. More precisely, if we denote the initial Wigner matrix by H0 then the distribution −t/2 −t 1/2 of Ht is the same as e H0 + (1 − e ) HG. Hence the rigidity holds in this case by (11.32) and (15.3) (n) holds with any sufficiently small σ > 0; we may choose σ = ε1. Recall pt,N denotes the correlation functions v w of Ht. We now apply Theorem 15.3 with H = H0 and H = Ht. The assumption (15.8) can be verified −1/2−ε by (15.6) if t 6 N for any ε > 0. From (15.9), we have Z E+b 0 Z dE (n) (n)  0 α  lim dα O(α) p0,N − pt,N E + = 0. (15.13) N→∞ 2b n N E−b R

This compares the correlation functions of H0 and Ht if t is not too large. To compare Ht with H∞ = HG, by −1/2−ε −1/2+10ε (n) (12.24), we have for t = N and b > N that (15.13) holds with p0,N (which is the correlation (n) functions of H0) replaced by pG,N . We have thus completed the proof of Theorem 15.1.

15.1 Proof of Theorem 15.2 The first ingredient to prove Theorem 15.2 is the following continuity estimate of matrix OU process. To state it, we need to introduce the deformation of a matrix H at certain matrix elements. For any i 6 j, let ij ij ij ij θ H denote a new matrix with matrix elements (θ H)k` = Hk` if {k, `} 6= {i, j} and (θ H)k,` = θk,`Hk` ij ij with some 0 6 θk,` 6 1 if {k, `} = {i, j}. In other words, the deformation operation θ multiplies the matrix elements Hij and Hji by a factor between 0 and 1 and leaves all other matrix elements intact.

Lemma 15.4. Suppose that H0 is a Wigner ensemble and fix t ∈ [0, 1]. Let g be a smooth function of the

matrix elements (hij)i6j and set  √  M := sup sup sup (N 3/2|h (s)3| + N|h (s)|) ∂3 g(θijH ) , (15.14) t E ij ij hij s 06s6t i6j θij where the last supremum runs through all deformations θij. Then √ Eg (Ht) − Eg (H0) = O(t N)Mt. This lemma holds also for generalized Wigner matrices. We refer the reader to [12] for the minor adjustment needed to this case.

129 Proof. Denote ∂ij = ∂hij ; notice that despite the two indices, this is still a first and not a second order partial derivate. By Itˆo’sformula, we have

X  1  ∂ g (H ) = − (h (t)∂ g(H )) − ∂2 g(H ) . tE t E ij ij t 2N E ij t i6j

A Taylor expansion of the first derivative ∂ijg in the direction hij yields   2 2  3 3 ij  E (hij(t)∂ijg(Ht)) = Ehij(t)∂ijghij (t)=0 + E hij(t) ∂ijghij (t)=0 + O sup E |hij(t) ∂ijg(θ Ht)| θij   2  3 3 ij  = sijE ∂ijghij (t)=0 + O sup E |hij(t) ∂ijg(θ Ht)| , θij   2  2  3 ij  E ∂ijg(Ht) = E ∂ijghij (t)=0 + O sup E |hij(t)(∂ijg(θ H)|) . θij

Here the short hand notation ∂ijghij (t)=0 means (∂ijg)(Het), where (Het)k` = (Ht)k` if {k, `}= 6 {i, j} and (Het)ij = (Het)ji = 0. In the calculation above we used the independence of the matrix elements, in particular that Het is independent of hij(t). We also used the fact that the OU process preserves the first and second 2 2 moments, i.e., Ehij(t) = Ehij(0) = 0 and E|hij(t)| = E|hij(0)| = 1/N. Thus we have   1/2 3/2 3 1/2 3 ij ∂tEg(Ht) = N O sup sup E(N |hij(t) | + N |hij(t)|)|∂ijg(θ Ht)| . i6j θij Integration over time finishes the proof. The second ingredient to prove Theorem 15.2 is an estimate of the type (15.3) but not only for diagonal matrix elements and for spectral parameters with imaginary parts slightly below N −1. Lemma 15.5. Suppose for a Wigner matrix H we have the following estimate   1 3σ+ε max sup sup Im ≺ N . (15.15) 1 k N −1−ε H − E ± iη 6 6 |E|62−κ η>N kk

−1−ε Then we have that for any η > N   1 3σ+ε sup sup ≺ N . (15.16) H − E ± iη 16k,`6N |E|62−κ k`

−1−ε We remark that (15.16) could be strengthened by including the supremum over η > N . This can be obtained by establishing the estimate first for a fine grid of η’s with spacing N −10 and then extend the bound for all η by using that the Green functions are Lipschitz continuous in η with a Lipschitz constant η−2.

Proof. Let λm and um denote the eigenvalues and eigenvectors of H, then by the definition of the Green function, we have

1/2 1/2   N " N 2 # " N 2 # 1 X |um(j)||um(k)| X |um(j)| X |um(k)| . (15.17) H − z 6 |λ − z| 6 |λ − z| |λ − z| jk m=1 m m=1 m m=1 m Define a dyadic decomposition

n−1 n Un = {m : 2 η 6 |λm − E| < 2 η}, n = 1, 2, . . . , n0 := C log N, (15.18)

n0 U0 = {m : |λm − E| < η},U∞ := {m : 2 η 6 |λm − E|},

130 and divide the summation over m into ∪nUn

N 2 2 2   X |um(j)| X X |um(j)| X X |um(j)| X 1 = 6 C Im n 6 C Im n . |λm − z| |λm − z| λm − E − i2 η H − E − i2 η jj m=1 n m∈Un n m∈Un n

Now using (15.15) we can control the right hand side of (15.17) and conclude (15.16). Now we can finish the proof of Theorem 15.2. First note that from the trivial bound

 1  y   1  Im 6 Im , η 6 y, (15.19) H − E − iη jj η H − E − iy jj and (15.3), the assumption (15.15) in Lemma 15.5 holds. Therefore the bounds (15.16) on the matrix elements are available. Next, we will first consider the specific function

g(H) = Gab(z), (15.20)

−1−ε for a fixed index pair a, b, where z = E + iη with N 6 η 6 1 and |E| 6 2 − κ. While this function is not of the form appearing in (15.6), once the comparison for Gab(z) is established, we take a = b and average over a to get the normalized trace of G, i.e., the Stieltjes transform. This will then prove (15.6) for the function F (x) = x, i.e., compare Em(t, z) with Em(0, z). The case of a polynomial F with several arguments is analogous, and similarly any function satisfying (15.4)–(15.5) can be sufficiently well approximated by a Taylor expansion. Returning to the case (15.20), in order to apply Lemma 15.4 we need to bound the third derivatives of g(H) to estimate Mt from (15.14). We have

3 X ∂ijG(z)ab = − G(z)aα1 G(z)β1,α2 G(z)β2,α3 G(z)β3,b, α,β where {αk, βk} = {i, j} or {j, i}. By (15.16), the following four expressions

G(z)aα1 ,G(z)β1α2 ,G(z)β2α3 ,G(z)β3b

4σ+ε −1−ε are bounded by N with very high probability provided N 6 η 6 1. Consequently, we proved that −1−ε uniformly in E ∈ (−2 + κ, 2 − κ), N 6 η 6 1,

3 20σ+5ε ∂ijG(z)ab = O(N ) with very high probability. The same argument holds if some matrix elements are reduced by deformation, i.e., we have 3 ij 20σ+5ε ∂ijg(θ Hs) 6 N 20σ+6ε −1/2 and thus (15.14) holds with Mt = N using the bound hij ≺ N . Hence by Lemma 15.4, we have proved that for any t 6 1 we have √ 20σ+6ε |Eg(Ht) − Eg(H0)| 6 CN (t N). This completes the proof of Theorem 15.2.

15.2 Proof of the correlation function comparison Theorem 15.3 Define an approximate delta function at the scale η by 1 1 θ (x) := Im . (15.21) η π x − iη

131 We will choose η ∼ N −1−a, with some small a > 0, i.e., slightly smaller than the typical eigenvalue spacing. This means that an observable of the form θη have sufficient resolution to detect individual eigenvalues. Moreover, polynomials of such observables detect correlation functions. On the other hand,

1 X θ (λ − E) = π Im m(z) , N η i i

therefore expectation values of such observables are covered by the condition (15.8). The rest of the proof consists of making this idea precise. There are two technicalities to resolve. First, correlation functions involve distinct eigenvalues (see (15.25) below), while polynomials of the resolvent include an overcounting of coinciding eigenvalues. Thus an exclusion-inclusion formula will be needed. Second, although η is much smaller than the relevant scale 1/N, it still does not give pointwise information on the correlation functions. However, the correlation functions in (4.38) are identified only as a weak limit, i.e., tested against a continuous function O. The continuity of O can be used to show that the difference between the exact correlation functions and the smeared out ones on the scale η ∼ N −1−a is negligible. This last step requires an a priori upper bound on the density to ensure that not too many eigenvalues fall into an irrelevantly small interval; this bound is given in (15.7), and it will eventually be verified by the local semicircle law.

Proof of Theorem 15.3. For notational simplicity, we give the detailed proof only for the case of n = 3-point correlation functions; the proof is analogous for the general case. −1−a Denote by η = N for some small a > 0; by following the proof one may check that any a 6 c min{δ, σ}/n2 will do. Let O be a compactly supported test function and let Z     1 β1 − α1 β3 − α3 Oη(β1, β2, β3) := 3 dα1dα2dα3O(α1, α2, α3)θη . . . θη N 3 N N R be its smoothing on scale Nη. We note that for any nonnegative C1 function O with compact support, there is a constant depending on O such that

0 Oη 6 COη0 , for any 0 6 η 6 η 6 1/N. (15.22)

0 −1+ε We can apply this bound with η = 1/N and combine it with (15.11) with η1 = 1/N and η2 = N for any small ε > 0 to have 3ε Oη 6 CN ON −1+ε , for any 0 6 η 6 1/N. (15.23)

αj After the change of variables xj = E + βj/N and with Ej := E + N , we have Z   (3) β1 β3 dβ1dβ2dβ3 Oη(β1, β2, β3)pw,N E + ,...,E + (15.24) 3 N N R Z Z (3) = dα1dα2dα3 O(α1, α2, α3) dx1dx2dx3pw,N (x1, x2, x3)θη(x1 − E1)θη(x2 − E2)θη(x3 − E3). 3 3 R R

By definition of the correlation function, for any fixed E, α1, α2, α3, Z (3) dx1dx2dx3pw,N (x1, x2, x3)θη(x1 − E1)θη(x2 − E2)θη(x3 − E3) (15.25)

1 X  α1   α2   α3  = w θ λ − E − θ λ − E − θ λ − E − , E N(N − 1)(N − 2) η i N η j N η k N i6=j6=k

where Ew indicates expectation w.r.t. the w variables. By the exclusion-inclusion principle, 1 X w θ (x − E )θ (x − E )θ (x − E ) = wA + wA + wA , (15.26) E N(N − 1)(N − 2) η 1 1 η 2 2 η 3 3 E 1 E 2 E 3 i6=j6=k

132 where 3 " # N 3 Y 1 X A := θ (λ − E ) , 1 N(N − 1)(N − 2) N η i j j=1 i 2 X A := θ (λ − E )θ (λ − E )θ (λ − E ), 3 N(N − 1)(N − 2) η i 1 η i 2 η i 3 i and A2 := B1 + B2 + B3 with Bj defined as follows: 1 X X B = − θ (λ − E )θ (λ − E ) θ (λ − E ); 3 N(N − 1)(N − 2) η i 1 η i 2 η k 3 i k

B1 consists of terms with j = k, and B2 consists of terms with i = k from the triple sum (15.26). We remark that A3 + A2 6 0 and thus Z 3 (3) w Y dx1dx2dx3pw,N (x1, x2, x3)θη(x1 − E1)θη(x2 − E2)θη(x3 − E3) 6 CE Im m(Ej + iη). (15.27) j=1

(3) The same bound holds for pv,N as well. Therefore, we can combine (15.23) and (15.7) to obtain for any small ε > 0 we have Z   (3) (3) β1 β3 3ε dβ1dβ2dβ3 Oη(β1, β2, β3)(pw,N + pv,N ) E + ,...,E + = O(N ). (15.28) 3 N N R To approximate wB , we define φ (x) = θ (x − E )θ (x − E ). Recall η = N −1−a and let η = E 3 E1,E2 η 1 η 2 b N −1−9a. Decompose θ = θ + θ where θ (y) = θ (y)1(|y| ηN 3a). Denote ηb b1 b2 b2 ηb > b Z −1 −3a θe = (1 − ψ(a)) θb1, ψ(a) := θb2(y)dy 6 N (15.29) so that R θ = 1. Using |φ0 (x)| Cη−2θ (x − E ) and |φ00 (x)| Cη−3θ (x − E ), we have e E1,E2 6 η 1 E1,E2 6 η 1 Z 1 ˆ |θe∗ φE ,E (x) − φE ,E (x)| dyθ1(y)[φE ,E (x − y) − φE ,E (x)] (15.30) 1 2 1 2 6 1 − ψ(a) 1 2 1 2 Z 2 6a ˆ 0 00 2 ηˆ N C dyθ1(y)[−φ (x)y + O(φ (x))y ] C θη(x − E1), 6 E1,E2 E1,E2 6 η3 (15.31)

ˆ R ˆ ˆ 3a where we used the symmetry of θ1, i.e., dyθ1(y)y = 0 and that θ1(y) is supported on |y| 6 ηNb . Together with (15.11) with the choice of η0 = N −1+ε for some small ε > 0, we have

1 X X ηˆ2N 6a 1 X [θe∗ φE ,E (λi) − φE ,E (λi)] θη(λk − E3) θη(λi − E1)θη(λk − E3) EN 3 1 2 1 2 6 Nη3 EN 2 i k i,k 2 8a+2ε ηˆ N 1 X 2 11a+2ε −6a θ −1+ε (λ − E )θ −1+ε (λ − E ) (Nηˆ) N N , (15.32) 6 Nη3 EN 2 N i 1 N k 3 6 6 i,k where we used the a priori bound (15.7) and then ε 6 a/2 in the two last steps. Hence we can approximate wB by E 3 Z w w −3 X −a E B3 = E dyφE1,E2 (y)N θe(λi − y)θη(λk − E3) + O(N ). (15.33) i,k −1 ˆ Recall θηˆ = θb1 + θb2 and θe = (1 − ψ(a)) θ1. We wish to replace θe in the right hand side of (15.33) by −1 −a (1 − ψ(a)) θηˆ with an error O(N ). For this purpose, we need to prove that the additional contribution −a from θb2 is bounded by O(N ).

133 −3a Since θb2(y) 6 CN θN 3aηˆ(y), we have Z w −3 X dyφE1,E2 (y)E N θb2(λi − y)θη(λk − E3) i,k Z w −3 X −3a 3a 6 dyφE1,E2 (y)E N N θN ηˆ(λi − y)θη(λk − E3). (15.34) i,k Now we show that the contribution of this error term to the r.h.s. of (15.24) is negligible. Recalling R E1 = E + α1/N and using dα1θη(y − E1) 6 CN, we have Z Z w −3 X −3a dα1dα2dα3 O(α1, α2, α3) dyθη(y − E1)θη(y − E2)E N N θN 3aηˆ(λi − y)θη(λk − E3) 3 R i,k Z −3a w −2 X X 6 CN dα2dα3E N θη ∗ θN 3aηˆ(λi − E2) θη(λk − E3) |α2|+|α3|6C i k Z −3a w −2 X X 6 CN dα2dα3E N θη(λi − E2) θη(λk − E3), (15.35) |α2|+|α3|6C i k 0 0 where we have used θη ∗ θη0 6 Cθη if η > η with C independent of η, η . Notice that we have only used that O is compactly supported and kOk∞ is bounded. Using (15.11) and (15.7), we can bound the last line by N −a+Cε. w As we will choose ε  a, neglecting the Cε exponent, we can thus approximate the contribution of E B3 to the r.h.s. of (15.24) by Z w dα1dα2dα3 O(α1, α2, α3)E B3 (15.36) 3 R Z Z −1 w −3 X −a = (1 − ψ(a)) dα1dα2dα3 O(α1, α2, α3) dyφE1,E2 (y)E N θηˆ(λi − y)θη(λk − E3) + O(N ). 3 R i,k

The same estimate holds if we replace the expectation Ew by Ev. Recalling (15.8), we can thus estimate their difference by Z w v dα1dα2dα3 O(α1, α2, α3)[E − E ]B3 (15.37) 3 R Z X C dyφ (y) [ w − v]N −3 θ (λ − y)θ (λ − E ) + O(N −a) 6 E1,E2 E E ηˆ i η k 3 i,k −1 −δ −a −a 6 C(Nη) N + O(N ) 6 O(N ) , provided that a 6 δ/2. Similar arguments can be used to prove that Z w v −a dα1dα2dα3 O(α1, α2, α3)[E − E ][A1 + A2 + A3] = O(N ). (15.38) 3 R Recalling (15.24), we have thus proved that Z   (3) (3) β1 β3 −a dβ1dβ2dβ3 Oη(β1, β2, β3)(pw,N − pv,N ) E + ,...,E + = O(N ), (15.39) 3 N N R

for any O compactly supported with kOk∞ bounded. In order to prove Theorem 15.3, it remains to replace Oη with O in (15.39); at this point we will use that O is differentiable. For this purpose, we only need to bound the error Z   (3) β1 β3 −a+ε dβ1dβ2dβ3 (O − Oη)(β1, β2, β3)pw,N E + ,...,E + = O(N ) (15.40) 3 N N R

134 and use the similar estimate with w replaced by v. One can check easily that there is a C1 nonnegative function Oe with compact support and kOek∞ 6 1 such that

|O − Oη| 6 C(Nη)Oeη. (15.41)

Hence (15.40) follows from applying (15.28) to Oe. Choosing ε much smaller than a, this completes the proof of Theorem 15.3.

135 16 Universality of Wigner matrices in small energy window: GFT

We have proved the universality of Wigner matrices in energy windows bigger than N −1/2+ε in Section 15. In this section, we will use the Green function comparison theorem to improve it to any energy windows of size bigger than N −1+ε. This is the Step 3 in the three step strategy introduced in Section 5. In the following, we will first prove the Green function comparison theorem and then use it to prove our main universality theorem, i.e., Theorem 5.1.

16.1 The Green function comparison theorems The main ingredient to prove this result is the following Green function comparison theorem stating that the correlation functions of eigenvalues of two matrix ensembles are identical on scale 1/N provided that the first four moments of all matrix elements of these two ensembles are almost identical. Here we do not assume that the real and imaginary parts are i.i.d., hence the k-th moment of hij is understood as the collection of R ¯s k−s numbers h h νij(dh), s = 0, 1, 2, . . . , k. Before stating the result we explain a notation. We will distinguish between the two ensembles by using (v) (w) different letters, vij and wij for their matrix elements and we often use the notation H and H to indicate the difference. Alternatively, one could denote the matrix elements of H by hij and make the distinction of the two ensembles in the measure, especially in the expectation, by using notations Ev and Ew. Since the matrix elements will be replaced one by one from one distribution to the other, the latter notation would have been cumbersome and so we will follow the first convention in this section.

Theorem 16.1 (Green function comparison). [69, Theorem 2.3] Suppose that we have two generalized N ×N (v) (w) −1/2 Wigner matrices, H and H , with matrix elements hij given by the random variables N vij and −1/2 N wij, respectively, with vij and wij satisfying the uniform polynomial decay condition (5.6). Fix a bijective ordering map on the index set of the independent matrix elements,

n o N(N + 1) φ : {(i, j) : 1 i j N} → 1, . . . , γ(N) , γ(N) := , 6 6 6 2

and denote by Hγ the generalized Wigner matrix whose matrix elements hij follow the v-distribution if (v) (w) φ(i, j) 6 γ and they follow the w-distribution otherwise; in particular H = H0 and H = Hγ(N). Let −1+τ κ > 0 be arbitrary and suppose that, for any small parameter τ > 0 and for any y > N , we have the following estimate on the diagonal elements of the resolvent   1 2τ max max max ≺ N (16.1) 1 k N 06γ6γ(N) 6 6 |E|62−κ Hγ − E − iy kk

with some constants C, c depending only on τ, κ. Moreover, we assume that the first four moments of vij and wij satisfy that s 4−s s 4−s −δ−2+s/2 Evijvij − Ewijwij 6 N , s = 0, 1, 2, 3, 4, (16.2) −1−ε −1 for some given δ > 0. Let ε > 0 be arbitrary and choose an η with N 6 η 6 N . For any sequence of complex parameters zj = Ej ± iη, j = 1, . . . , n, with |Ej| 6 2 − 2κ and with an arbitrary choice of the ± (v) (v) −1 signs. Let G (z) = (H − z) denote the resolvent and let F (x1, . . . , xn) be a function such that for any 0 multi-index α = (α1, . . . , αn) with 1 6 |α| 6 5 and for any ε > 0 sufficiently small, we have

 0  0 α ε C0ε max |∂ F (x1, . . . , xn)| : max |xj| N N (16.3) j 6 6

and   α 2 C0 max |∂ F (x1, . . . , xn)| : max |xj| N N (16.4) j 6 6

for some constant C0.

136 P −1−ε −1 Then, there is a constant C1, depending on ϑ, m km and C0 such that for any η with N 6 η 6 N m and for any choices of the signs in the imaginary part of zj , we have

    (v) (v) (v) (w) −1/2+C1ε −δ+C1ε F G (z1),...,G (zn) − F G → G C1N + C1N , (16.5) E a1b1 anbn E 6 where the arguments of F in the second term are changed from the Green functions of H(v) to H(w) and all other parameters remain unchanged. n Moreover, for any |E| 6 2−2κ, any k > 1 and any compactly supported smooth test function O : R → R we have Z  (n) (n)  α  lim dα O(α) pv,N − pw,N E + = 0, (16.6) N→∞ n N R (n) (n) where pv,N and pw,N denote the n-point correlation functions of the v and w-ensembles, respectively. Remark 1: We formulated Theorem 16.1 for functions of traces of monomials of the Green function because this is the form we need in the application. However, the result (and the proof we are going to present) holds directly for matrix elements of monomials of Green functions as well, for the precise statement, see [69]. We also remark that Theorem 16.1 holds for generalized Wigner matrices if Csup = supij Nsij < ∞. The positive lower bound on the variances, Cinf > 0 in (2.6) is not necessary for this theorem.

Remark 2: Although we state Theorem 16.1 for Hermitian and symmetric ensembles, similar results hold for real and complex sample covariance ensembles; the modification of the proof is obvious.

Proof of the Green function comparison Theorem 16.1: The basic idea is to estimate the effect of changing matrix elements of the resolvent one by one by a resolvent expansion. Since each matrix element has a typical size of N −1/2 and the resolvents are essentially bounded thanks to (16.1), a resolvent expansion up to the fourth order will identify the change of each element with a precision O(N −5/2) (modulo some tiny corrections of order N O(τ)). The expectation values of the terms up to fourth order involve only the first four moments of the single entry distribution, which can be directly compared. The error terms are negligible even when we sum them up N 2 times, the number of comparison steps needed to replace all matrix elements.

Now we turn to the one by one replacement. For notational simplicity, we will consider the case when the test function F has only n = 1 variable and k1 = 1, i.e., we consider the trace of a first order monomial; the general case follows analogously. Consider the telescopic sum of differences of expectations  1 1   1 1  F Tr − F Tr (16.7) E N H(v) − z E N H(w) − z γ(N) X   1 1   1 1  = F Tr − F Tr . E N H − z E N H − z γ=1 γ γ−1

Let E(ij) denote the matrix whose matrix elements are zero everywhere except at the (i, j) position, where (ij) it is 1, i.e., Ek` = δikδj`. Fix an γ > 1 and let (i, j) be determined by φ(i, j) = γ. We will compare Hγ−1 with Hγ . Note that these two matrices differ only in the (i, j) and (j, i) matrix elements and they can be written as 1 (ij) (ji) Hγ−1 = Q + √ V,V := vijE + vjiE N

1 (ij) (ji) Hγ = Q + √ W, W := wijE + wjiE , N with a matrix Q that has element at the (i, j) and (j, i) positions and where we set vji := vij for i < j and similarly for w. Define the Green functions 1 1 R := ,S := . Q − z Hγ − z

137 We first claim that the estimate (15.16) holds for the Green function R as well. To see this, we have, from the resolvent expansion,

R = S + N −1/2SVS + ... + N −9/5(SV )9S + N −5(SV )10R.

Since V has only at most two nonzero element, when computing the (k, `) matrix element of this matrix identity, each term is a finite sum involving matrix elements of S or R and vij, e.g. (SVS)k` = SkivijSj` + SkjvjiSi`. Using the bound (15.16) for the S matrix elements, the subexponential decay for vij and the −1 trivial bound |Rij| 6 η , we obtain that the estimate (15.16) holds for R. We can now start proving the main result by comparing the resolvents of H(γ−1) and H(γ) with the resolvent R of the reference matrix Q. By the resolvent expansion,

S = R − N −1/2RVR + N −1(RV )2R − N −3/2(RV )3R + N −2(RV )4R − N −5/2(RV )5S, (16.8)

so we can write 4 1 X −m/2 (m) −5/2 Tr S = Rb + ξ, ξ := N Rb + N Ωv N v m=1 with 1 (m) m 1 m 1 5 Rb := Tr R, Rb := (−1) Tr(RV ) R, Ωv := − Tr(RV ) S. (16.9) N v N N (m) For each diagonal element in the computation of these traces, the contribution to Rb, Rbv and Ωv is a sum of a few terms. E.g.

(2) 1 X h i Rb = RkivijRjjvjiRik + RkivijRjivijRjk + RkjvjiRiivijRjk + RkjvjiRijvjiRik v N k and similar formulas hold for the other terms. Then we have  1 1    EF Tr =EF Rb + ξ (16.10) N Hγ − z h 0 00 2 (5) 0 5i =E F (Rb) + F (Rb)ξ + F (Rb)ξ + ... + F (Rb + ξ )ξ 5 X −m/2 (m) = N EAv , m=0

0 (m) where ξ is a number between 0 and ξ and it depends on Rb and ξ; the Av ’s are defined as

(0) (1) 0 (1) (2) 00 (1) 2 0 (2) Av = F (Rb),Av = F (Rb)Rbv ,Av = F (Rb)(Rbv ) + F (Rb)Rbv ,

(3) (4) and similarly for Av and Av . Finally,

(5) 0 (5) 0 (1) 5 Av = F (Rb)Ω + F (Rb + ξ )(Rbv ) + ....

(m) The expectation values of the terms Av , m 6 4, with respect to vij are determined by the first four moments of vij, for example

(2) 0 h 1 X i 2 00 h 1 X i 2 A = F (Rb) RkiRjjRik + ... |vij| + F (Rb) RkiRj`R`jRik + ... |vij| E v N E N 2 E k k,`

0 h 1 X i 2 00 h 1 X i 2 +F (Rb) RkiRjiRjk + ... v + F (Rb) RkiRj`R`iRjk + ... v . N E ij N 2 E ij k k,`

138 Note that the coefficients involve up to four derivatives of F and normalized sums of matrix elements of R. Using the estimate (15.16) for R and the derivative bounds (16.3) for the typical values of Rb, we see that all these coefficients are bounded by N C(τ+ε) with a very large probability, where C is an explicit constant. We use the bound (16.4) for the extreme values of Rb but this event has a very small probability by (15.16). s u (0) (4) Therefore, the coefficients of the moments E v¯ijvij, u + s 6 4, in the quantities Av ,...,Av are essentially C(τ+ε) bounded, modulo a factor N . Notice that the k-th moment of vij appears only in the m = k term that already has a prefactor N −k/2 in (16.10). (5) Finally, we have to estimate the error term Av . All terms without Ω can be dealt with as before; after C(τ+ε) estimating the derivatives of F by N , one can perform the expectation with respect to vij that is independent of Rb(m). For the terms involving Ω one can argue similarly, by appealing to the fact that the C(τ+ε) matrix elements of S are also essentially bounded by N , see (15.16), and that vij has subexponential decay. Alternatively, one can use H¨olderinequality to decouple S from the rest and use (15.16) directly, for example:

1/2 0 1 0 5 1 h 0 2 2i  5 ∗ 51/2 − 5 +C(τ+ε) |F (Rb)Ω| = |F (Rb) Tr(RV ) S| (F (Rb)) Tr S Tr(RV ) (VR ) CN 2 . E N E 6 N E E 6

Note that exactly the same perturbation expansion holds for the resolvent of Hγ−1 with Av replaced by Aw, i.e., vij is replaced with wij everywhere. By the moment matching condition (16.2), the expectation values

−m/2 (m) (m) C(τ+ε)−δ−2 N EAv − EAw 6 N for m 6 4 in (16.10). Choosing τ = ε, we have     1 1 1 1 −5/2+Cε −2−δ+Cε E F Tr − E F Tr 6 CN + CN . N Hγ − z N Hγ−1 − z

After summing up in (16.7) we have thus proved that

 1 1   1 1  F Tr − F Tr CN −1/2+Cε + CN −δ+Cε. E N H(v) − z E N H(w) − z 6

The proof can be easily generalized to functions of several variables. This concludes the proof of Theo- rem 16.1.

16.2 Conclusion of the three step strategy In this section we put the previous results together to prove our main result Theorem 5.1. Recall the main result of the step 2, Theorem 12.4, asserts the universality of Gaussian divisible ensemble. However, in order to reach energy window N −1+ε, we need to substantial large Gauss component. In terms of DBM, we need −cε the time t > N for some c > 0. The eigenvalue continuity result in the previous section is unfortunately −1−ε restricted to t 6 N . To over come the deficiency of the continuity, we will approximate the Wigner ensemble by a suitably chosen Gaussian divisible ensemble. Thus we term this step “ Approximation by a Gaussian divisible ensemble”. We first review the classical moment matching problem for real random variables. For any real random k variable ξ, denote by mk(ξ) = Eξ its k-th moment. By Schwarz inequality, the sequence of moments, k 2 m1, m2,... are not arbitrary numbers, for example |m1| 6 mk and m2 6 m4, etc, but there are more subtle relations. For example, if m1 = 0, then 2 3 m4m2 − m3 > m2 , (16.11) which can be obtained by

2  32  2 2  2 2 2 2 m3 = Eξ = Eξ(ξ − 1) 6 Eξ E(ξ − 1) = m2(m4 − 2m2 + 1)

139 and noticing that (16.11) is scale invariant, so it is sufficient to prove it for m2 = 1. In fact, it is easy to see that (16.11) saturates if and only of the support of ξ consists of exactly two points (apart from the trivial case when ξ ≡ 0). This restriction shows that given a sequence of four admissible moments, m1 = 0, m2 = 1, m3, m4, there may not exist a Gaussian divisible random variable ξ with these moments; e.g. the moment sequence (m1, m2, m3, m4) = (0, 1, 0, 1) uniquely characterizes the standard Bernoulli variable (ξ = ±1 with 1/2-1/2 probability). However, if we allow a bit room in the fourth moment, then one can match any four admissible moments by a small Gaussian divisible variable. This is the content of the next lemma [69, Lemma 6.5]. This matching idea in the context of random matrices appeared earlier in [130]. For simplicity, we formulate it in the case of real variables, the extension to complex variables is standard.

Lemma 16.2. [Moment matching] Let m3 and m4 be two real numbers such that

2 m4 − m3 − 1 > 0, m4 6 C2 (16.12)

G for some positive constant C2. Let ξ be a Gaussian random variable with mean 0 and variance 1. Then for any sufficient small γ > 0 (depending on C2), there exists a real random variable ξγ with subexponential decay and independent of ξG, such that the first four moments of

0 1/2 1/2 G ξ = (1 − γ) ξγ + γ ξ (16.13)

0 0 0 0 are m1(ξ ) = 0, m2(ξ ) = 1, m3(ξ ) = m3 and m4(ξ ), and

0 |m4(ξ ) − m4| 6 Cγ (16.14) for some C depending on C2.

Proof. We first construct a random variable X with the first four moments given by 0, 1, m3, m4 satisfying 2 m4 − m3 − 1 > 0. We take the law of X to be of the form

pδa + qδ−b + (1 − p − q)δ0, where a, b, p, q > 0 are parameters satisfying p + q 6 1. The conditions m1(X) = 0 and m2(X) = 1 imply 1 1 p = , q = . a(a + b) b(a + b)

Furthermore, the condition p + q 6 1 reads ab > 1. By an explicit computation we find

2 m3(X) = a − b , m4(X) = m3(X) + ab . (16.15)

2 Clearly, the condition m4 − m3 − 1 > 0 implies that (16.15) has a solution with ab > 1. This proves the existence of a random variable supported at three points with given four moments 0, 1, m3, m4 satisfying 2 m4 − m3 − 1 > 0. Our main task is to construct a Gaussian divisible one which approximate the four moment matching condition stated in the lemma. For any real random variable ζ, independent of ξG, and with the first four moments being 0, 1, m3(ζ) and m4(ζ) < ∞, the first four moments of

ζ0 = (1 − γ)1/2ζ + γ1/2ξG (16.16) are 0, 1, 0 3/2 m3(ζ ) = (1 − γ) m3(ζ) (16.17) and 0 2 2 m4(ζ ) = (1 − γ) m4(ζ) + 6γ − 3γ . (16.18)

140 Since we can match four moments by a random variable, for any γ > 0 there exists a real random variable ξγ such that the first four moments are 0, 1,

−3/2 m3(ξγ ) = (1 − γ) m3 (16.19) and 2 2 m4(ξγ ) = m3(ξγ ) + (m4 − m3).

2 3/2 With m4 6 C2, we have m3 6 C2 , thus

|m4(ξγ ) − m4| 6 Cγ (16.20)

0 1/2 1/2 G for some C depending on C2. Hence with (16.17) and (16.18), we obtain that ξ = (1 − γ) ξγ + γ ξ 0 satisfies m3(ξ ) = m3 and (16.14). This completes the proof of Lemma 16.2. We now prove Theorem 5.1, i.e, that the limit in (15.1) holds. We restrict ourselves to the real symmetric case, since the moment matching Lemma 16.2 was formulated for real random variables. A similar argument works for the complex case. From the universality of Gaussian divisible ensembles (more precisely, (12.24)) −1+10ε −ε for any b = N and t > N , we have Z E+b Z 1 0 (n) (n)  0 α  dE dα O(α) pt,N − pG,N E + → 0 , (16.21) 2b n N E−b R √ (n) −t/2 −t G where pt,N is the eigenvalue correlation function of the matrix ensemble Ht = e H0 + 1 − e H . The initial matrix ensemble H0 can be any Wigner ensemble. Recall H is the Wigner ensemble for which we √wish to prove (15.1). Given the first four moments m1 = 0, m2 = 1, m3, m4 of the rescaled matrix elements −t Nhij of H, we first use Lemma 16.2 to construct a distribution ξγ with γ = 1 − e . This distribution has the property that the first four moments of ξ0, defined by (16.13), matches the first four moments of the rescaled matrix elements of H. We now choose H0 to be the Wigner ensemble with rescaled matrix elements distributed according to ξγ . Then, on one hand, the matrix Ht will have universal local spectral statistics in the sense of (16.21); on the other hand we can apply the Green function comparison Theorem 16.1 to the matrix ensembles H and Ht so that (15.9) holds for these two ensembles. Clearly, (15.1) follows from (16.21) and (15.9). Note that averaging in the energy parameter E was necessary only for the first step in (16.21), the comparison argument works for any fixed energy E.

141 17 Edge Universality

The main result of edge universality for generalized Wigner matrix is given by the following theorem. For simplicity, we formulate it only for the real symmetric case, the complex Hermitian case is analogous.

Theorem 17.1 (Universality of extreme eigenvalues). Suppose that we have two real symmetric generalized (v) (w) −1/2 N × N Wigner matrices, H and H , with matrix elements hij given by the random variables N vij −1/2 and N wij, respectively, with vij and wij satisfying the uniform subexponential decay condition (2.7). Suppose that 2 2 Evij = Ewij. (17.1)

(v) (w) (v) (w) Let λN and λN denote the largest eigenvalues of H and H , respectively. Then there is an ε > 0 and δ > 0 depending on ϑ in (2.7) such that for any real parameter s (may depend on N) we have

2/3 (v) −ε −δ 2/3 (w) 2/3 (v) −ε −δ P(N (λN − 2) 6 s − N ) − N 6 P(N (λN − 2) 6 s) 6 P(N (λN − 2) 6 s + N ) + N (17.2)

for N > N0 sufficiently large, where N0 is independent of s. Analogous result holds for the smallest eigenvalue λ1. This theorem shows that the statistics of the extreme eigenvalues depend only on the second moments of the centered matrix entries, but the result does not determine the distribution. For the Gaussian case, the corresponding distribution was identified by Tracy and Widom in [133,134]. Theorem 17.1 immediately gives the same result for Wigner matrices, but not yet for generalized Wigner matrices. In fact, the Tracy-Widom law for generalized Wigner matrices does not follow from Theorem 17.1 and its proof in [24] required a quite different argument, see Theorem 18.7. (v) We first give an outline of the proof of Theorem 17.1. Denote by %N the empirical eigenvalue distribution

N (v) 1 X % (x) = δ(x − λ(v)) . N N α α=1

(v) (w) (v) (w) Our goal is to compare the difference of %N and %N via their Stieltjes transforms, mN (z) and mN (z). We will be able to locate the largest eigenvalue via the distribution of certain functionals of the Stieltjes (v) transforms. This will be done in the preparatory Lemma 17.2 and its Corollary 17.3. Therefore, if mN (z) (w) (v) (w) and mN (z) are sufficiently close, we will be able to compare λN and λN . The main ingredient of the proof of Theorem 17.1 is the Green function comparison theorem at the edge (v) (w) −2/3 (Theorem 17.4), which shows that the distributions of mN (z) and mN (z) on the critical scale Im z ≈ N are the same provided the second moments of the matrix elements match. This will be done by a resolvent expansion, similarly to the Green function comparison theorem in the bulk (Theorem 16.1). However, the resolvent expansion is more efficient at the edge for a reason that we explain now.

Recall (6.13) stating that msc(z)  η if κ = |E| − 2 6 η and z = E + iη. Since the extremal eigenvalue gaps are expected to be order N −2/3, we need to set η  N −2/3 in order to identify individual eigenvalues. This is a smaller scale than that of the local semicircle law; in fact mN (z) does not have a deterministic limit; we need to identify its distribution. Hence the reference size of Im mN (z) which we should keep in mind is N −1/3. The largest eigenvalue can be located via the Stieltjes transform as follows. By definition of mN , if there is an eigenvalue in a neighborhood of E within distance η then 1 X η 1 Im m (E + iη) =  N −1/3, N N (λ − E)2 + η2 > Nη α α

−1/3 and otherwise Im mN (E + iη) . N . Based upon this idea, we will construct a certain functional of Im mN that expresses the distribution of λN . In other words, we expect that if we can control the imaginary part of the trace of the Green function by N −1/3, then we can identify the individual extremal eigenvalues.

142 Roughly speaking, our basic principle is that whenever we understand mN (z) with a precision better than (Nη)−1, i.e., we can improve over the strong local semicircle law (6.32)

1 mN (z) − msc(z) ≺ , Im z = η Nη

at some scale η, then we can ”identify” the eigenvalue distribution at that scale precisely. It is instructive to compare the edge situation with the bulk, where the critical scale η  1/N identifies individual eigenvalues and the typical size of Im mN is of order 1. In other words, at the critical scale Im mN is of order one in the bulk and it is of order N −1/3 at the edge. Owing to (6.31), this fact extends to each resolvent matrix −1/3 element, i.e., |Gij| . N at the edge. Thus a resolvent expansion is more powerful near the edge, which explains that less moment matching is needed than in the bulk. Now we start the detailed proof of Theorem 17.1. We first introduce a few notation. For any E1 6 E2, let

N (E1,E2) := #{j : E1 6 λj 6 E2} = Tr 1[E1,E2](H) −2/3+ε be the unnormalized counting function of the eigenvalues. Set E+ := 2 + N . From rigidity, Theo- rem 11.5, we know that λN 6 E+, i.e.,

N (E, ∞) = N (E,E+) (17.3)

with very high probability, so it is sufficient to consider the counting function N (E,E+) = Tr χE(H), with

χE(x) := 1[E,E+](x). We will approximate the sharp cutoff function χE its smoothened version χE ? θη on −2/3 scale η  N , where we recall the notation θη(x) from (15.21). The following lemma shows that the error caused by this replacement is negligible.

−2/3−3ε −2/3−9ε Lemma 17.2. For any ε > 0 set `1 = N and η = N . Then there exist constants C, c such that 3 −2/3+ε for any E satisfying |E − 2| 6 2 N and for any D we have ( ) −2ε  −D P Tr χE(H) − Tr χE ? θη(H) 6 C N + N (E − `1,E + `1) > 1 − N (17.4)

if N > N0(D) is large enough. We will not give the detailed proof of this lemma since it is a straightforward approximation argument. −2/3 We only point out that θη is an approximate delta function on scale η  N , and if it were compactly supported, then the difference between Tr χE(H) and Tr χE ? θη(H) were clearly bounded by the number of −1 2 2 eigenvalues in an η-vicinity of E. Since θη(x) = π η/(x + η ), i.e., it has a quadratically decaying tail, the above argument must be complemented by estimating the density of eigenvalues away from E. This estimate comes from two contributions. Eigenvalues within the `1-vicinity of E are directly estimated by the term −6ε N (E − `1,E + `1). Eigenvalues farther away come with an additional factor η/`1 = N due to the decay ε of θη, thus it is sufficient to estimate their density only up to an N factor precision, which is provided by the optimal local law at the edge. For precise details, see the proof of Lemma 6.1 in [70] for essentially the same argument. Using (17.3) and the local law to estimate a slightly averaged version of N (E − `1,E + `1), we arrive at the following Corollary:

−2/3+ε 1 2ε 1 −2/3−ε Corollary 17.3. Let |E − 2| 6 N , ` = 2 `1N = 2 N . Then the inequality

−ε −ε Tr χE+` ? θη(H) − N 6 N (E, ∞) 6 Tr χE−` ? θη(H) + N (17.5) holds with probability bigger than 1 − N −D for any D if N is large enough. Furthermore, we have

   −D EF Tr χE−` ? θη(H) 6 P N (E, ∞) = 0 6 EF Tr χE+` ? θη(H) + CN . (17.6)

143 3 −2/3−ε Proof. We have E+ − E  ` thus |E − 2 − `| 6 2 N , therefore (17.4) holds for E replaced with y ∈ [E − `, E] as well. We thus obtain Z E −1 Tr χE(H) 6 ` dy Tr χy(H) E−` Z E Z E −1 −1  −2ε  6 ` dy Tr χy ∗ θη(H) + C` dy N + N (y − `1, y + `1) E−` E−` ` Tr χ ∗ θ (H) + CN −2ε + C 1 N (E − 2`, E + `) 6 E−` η ` with a probability larger than 1 − N −D for any D > 0. From the rigidity estimate in the form (11.25), the −2/3+ε −2ε −2/3 conditions |E − 2| 6 N , `1/` = 2N and ` 6 N , we can bound Z E+` `1 1−2ε −2ε ε −ε N (E − 2`, E + `) 6 N %sc(x)dx + N N 6 CN ` E−2` with a very high probability, where we estimated the explicit integral using that the integration domain is in a CN −2/3+ε-vicinity of the edge at 2. We have thus proved

−ε N (E,E+) = Tr χE(H) 6 Tr χE−` ∗ θη(H) + N .

By (17.3), we can replace N (E,E+) by N (E, ∞). This proves the upper bound of (17.5) and the lower bound can be proved similarly. On the event that (17.5) holds, the condition N (E, ∞) = 0 implies that Tr χE+` ∗ θη(H) 6 1/9. Thus we have −D P (N (E, ∞) = 0) 6 P (Tr χE+` ∗ θη(H) 6 1/9) + CN . (17.7) Together with the Markov inequality, this proves the upper bound in (17.6). For the lower bound, we use   −ε  E F Tr χE−` ∗ θη(H) 6 P Tr χE−` ∗ θη(H) 6 2/9 6 P N (E, ∞) 6 2/9 + N = P N (E, ∞) = 0 , where we used the upper bound from (17.5) and that N is an integer. This completes the proof of the Corollary.  Since P(λN < E) = P N (E, ∞) = 0 , we will use (17.6) and the identity

Z E+ Tr χE ? θη(H) = N dy Im mN (y + iη) (17.8) E

to relate the distribution of λN with that of the Stieltjes transform mN . The following Green function comparison theorem shows that the distribution of the right hand side of (17.8) depends only on the second moments of the matrix elements. Its proof will be given after we have completed the proof of Theorem 17.1. Theorem 17.4 (Green function comparison theorem on the edge). Suppose that the assumptions of Theorem 17.1, including (17.1), hold. Let F : R → R be a function whose derivatives satisfy

(α) −C1 max |F (x)| (|x| + 1) C1, α = 1, 2, 3, (17.9) x 6

with some constant C1 > 0. Then there exists ε0 > 0 depending only on C1 such that for any ε < ε0 and for any real numbers E1 and E2 satisfying −2/3+ε −2/3+ε |E1 − 2| 6 CN , |E2 − 2| 6 CN , and setting η = N −2/3−ε, we have ! ! Z E2 Z E2 (v) (w) −1/6+Cε EF N dy Im m (y + iη) − EF N dy Im m (y + iη) 6 CN (17.10) E1 E1 for some constant C and large enough N depending only on C1, ϑ, δ± and C0 (in (14.26)).

144 1/3+ε Note that the N times the integration gives a factor N|E1 − E2| ∼ N on the left hand side which compensates for the ”natural” size of the imaginary part of the Stieltjes transform obtained from the lo- cal semicircle law and makes the argument of F to be of order one. The bound (17.10) shows that the distributions of this order one quantity with respect to the v and w ensembles are asymptotically the same.

Assuming that Theorem 17.4 holds, we now prove Theorem 17.1. Proof of Theorem 17.1. By the rigidity of eigenvalues in the form of (11.25), we know that

−2/3+ε −2/3+ε −2/3+ε ε |λN − 2| 6 N , N (2 − CN , 2 + CN ) 6 CN ε with very high probability. Thus we can assume that |s| 6 N holds for the parameter s in Theorem 17.1; the statement is trivial for other values of s. We define E := 2 + sN −2/3. With the left side of (17.6), for any sufficiently small ε > 0, we have

w w E F (Tr χE−` ∗ θη(H)) 6 P (N (E, ∞) = 0) (17.11) with the choice 1 ` := N −2/3−ε, η := N −2/3−9ε. 2

The bound (17.10) applying to the case E1 = E −` and E2 = E+ shows that there exist δ > 0, for sufficiently small ε > 0, such that

v w −δ E F (Tr χE−` ∗ θη(H)) 6 E F (Tr χE−` ∗ θη(H)) + N (17.12) (note that 9ε plays the role of the ε in the Green function comparison theorem). Then applying the right side of (17.6) to the l.h.s of (17.12), we have

v v −D P (N (E − 2`, ∞) = 0) 6 E F (Tr χE−` ∗ θη(H)) + CN . (17.13) Combining these inequalities, we have

v w −δ P (N (E − 2`, ∞) = 0) 6 P (N (E, ∞) = 0) + 2N (17.14) for sufficiently small ε > 0 and sufficiently large N. Recalling that E = 2 + sN −2/3, this proves the first inequality of (17.2) and, by switching the role of v, w, the second inequality of (17.2) as well. This completes the proof of Theorem 17.1.

Proof of Theorem 17.4. We follow the notations and the setup in the proof of Theorem 16.1. First we prove a simpler version of (17.10) with F (x) = x and without integration. Namely, we show that for any E with −2/3+ε |E − 2| 6 N we have 1 1 1 1 Nη m(v)(z) − m(w)(z) = Nη Tr − Tr CN −1/6+Cε, z = E + iη. E N E N E N H(v) − z E N H(w) − z 6 (17.15)

The prefactor Nη accounts for the N times the integration from E1 to E2 in (17.10) since |E1 − E2| 6 Cη. We write

γ(N) 1 1 1 1 X  1 1 1 1  Tr − Tr = Tr − Tr , (17.16) E N H(v) − z E N H(w) − z E N H − z E N H − z γ=1 γ γ−1

−1 −1 with R := (Q − z) , S := (Hγ − z) and

4 1 X −m/2 (m) −5/2 Tr S = Rb + ξv, ξv := N Rb + N Ωv (17.17) N v m=1

145 with 1 (m) m 1 m 1 5 Rb := Tr R, Rb := (−1) Tr(RV ) R, Ωv := − Tr(RV ) S. (17.18) N v N N All these quantities depend on γ, equivalently on the index pair (i, j) with φ(i, j) = γ (see the proof of Theorem 16.1), i.e., Rb = Rb(γ) etc., but we will omit this fact from the notation. We also define the same quantities for v replaced with w. By the moment matching condition (17.1),

(m) (m) ERbv = ERbw , m = 0, 1, 2, −2/3+ε so we can consider the m = 3, 4 terms only in the summation in (17.17). Since |E − 2| 6 N and η = N −2/3−ε, the strong local semicircle law (6.32) implies that s Im msc(z) 1 −1/3+Cε Rij(z) − δijmsc(z) ≺ + N , (17.19) Nη Nη 6 and similar bound holds for S. In particular, the off-diagonal terms of the Green functions are of order N −1/3+Cε and the diagonal terms are bounded with high probability. As a crude bound, we immediately (m) Cε obtain that |Rbv |, |Ωv| 6 N with high probability. Hence we have γ(N) 1 1 1 1 4 h i X X 2 −m/2 (m) (m) 1/2+Cε (Nη) E Tr − E Tr N (Nη)N max E Rbv − Rbw + ηN . N H − z N H − z 6 γ γ=1 γ γ−1 m=3 (17.20) Given a fixed γ with φ(i, j) = γ, we have the explicit formula m (m) (m) m 1 X X X h i Rb = Rb (γ) = (−1) Rka1 va1b1 Rb1a2 . . . vambm Rbmk . (17.21) v v N k `=1 {a`,b`}={i,j} First we discuss the generic case, when all indices are different, in particular i 6= j, and later we comment

on the case of the coinciding indices. Notice that generically, the first and the last resolvents Rka1 ,Rbmk are off-diagonal terms, thus their size is N −1/3+Cε. Every other factor in (17.21) is O(N ε). Hence the generic contributions to the sum (17.20) give

2 −m/2 (m) 2−m/2 −1/3+Cε N (Nη)N Rbv 6 N N . (17.22) For m = 4 this is N −1/3+Cε, so it is negligible. For m = 3, we have 2 −m/2 (3) 1/6+Cε N (Nη)N Rbv 6 N , (17.23) so the straightforward power counting is not sufficient. Our goal is to show that the expectation of this term is in fact much smaller. To see this, we can assume that on the right hand side of (17.21), all the Green functions except the first and the last one are diagonal since otherwise we have an additional (third) off-diagonal term, which yields an extra factor N −1/3+Cε, which would be sufficient if combined with the trivial power counting yielding (17.23). If both intermediate Green functions in (17.21) are diagonal, then we can replace each of them with msc using the strong local semicircle law (17.19) at the expense of a multiplicative error N −1/3+Cε. This improves the previous power counting of order N 1/6+Cε to N −1/6+Cε as required in (17.15). Thus we only have to consider the contributions from the terms 1 X X h i m2 R v v v R , (17.24) N scE ka1 a1b1 a2b2 a3b3 b3k k {a`,b`}={i,j}

(3) contributing to Rbv , where the summation is restricted to the choices consistent with the fact that the all Green functions in the middle are diagonal, i.e., b1 = a2, b2 = a3. Together with the restriction {a`, b`} = {i, j} and the assumption that all indices are distinct, it yields only two terms: 1 X h i 1 X h i m2 R v v v R + m2 R v v v R . (17.25) N scE ki ij ji ij jk N scE kj ji ij ji ik k k

146 2 −3/2 5/6 Considering the prefactors N (Nη)N 6 N in (17.23), in order to prove (17.15) we need to show that both terms in (17.25) are bounded by N −1+Cε, i.e., they are at least by a factor N −1/3 smaller than the trivial power counting N −2/3+Cε from the two off-diagonal elements would give. We will focus on the first term in (17.25). For the last resolvent entry, Rjk, we use the identity

(i) RjiRik Rjk = Rjk + Rii

−1/3+Cε from (8.1). The second factor has already two off-diagonal elements, RjiRik, yielding an additional N factor with high probability (i.e., even without taking the expectation). So we focus on the term

1 X h (i)i m2 R v v v R . (17.26) N scE ki ij ji ij jk k

Recall the identity (8.2) (i) X (i) −1/2  Rki = −Rkk Rk` N v`i . ` Hence we have that

(i) 1 X X h (i) (i)i (17.26) = − m2 N −1/2 R R v v v v R . (17.27) N sc E kk k` `i ij ji ij jk k `

We may again replace the diagonal term Rkk with msc at a negligible error and we are left with

(i) 3 −3/2 X X h (i) (i)i −mscN E Rk` v`ivijvjivijRjk . (17.28) k `

The expectation with respect to the variable v`i renders this term zero, unless ` = j, in which case we gain an additional factor N −1/2+Cε in the power counting compared with (17.26). So the contribution of the last displayed expression to (17.23) is improved from N 1/6+Cε to N −1/3+Cε. The argument so far assumed that all indices are distinct. In case of each coinciding index we gain a factor 1/N from the restriction in the summation and one off-diagonal element, accounted for as a factor N −1/3+Cε, may be ”lost” in a sense that it may become diagonal. The total balance of a coinciding index in the power counting is thus a factor N −2/3, so we conclude that terms with coinciding indices are negligible. This proves the simplified version (16.5) of Theorem 17.4. To prove (17.10), for simplicity we ignore the integration and we prove only that     (v) (w) −1/6+Cε EF Nη Im m (z) − EF Nη Im m (z) 6 CN , z = E + iη, (17.29)

−2/3+ε 2ε for any E with |E −2| 6 N . The η factor, up to an irrelevant N accounts for the integration domain −2/3+ε of length |E2 − E1| 6 N in the arguments in (17.10). The general form (17.10) follows in the same way. We point out that (17.29) with F (Nη Im m(z)) replaced by F (Nη m(z)) also holds with the same argument given below. Taking into account the telescopic summation over all γ’s, for the proof of (17.29) it is sufficient to prove that     2 1 1 1 1 −1/6+Cε N EF Nη Im Tr − EF Nη Im Tr 6 CN , (17.30) N Hγ − z N Hγ − z and then sum it up for all γ = φ(i, j). As before, we assume the generic case, i.e., i 6= j. We Taylor expand the function F around the point

Ξ := Nη Im m(R)(z), z = E + iη,

147 (R) 1 (R) where m = N Tr R. Notice that m is independent of vij and wij. Recalling the definition of

4 X −m/2 (m) −5/2 ξv = N Rbv + N Ωv (17.31) m=1

from (17.17), and a similar definition for ξw, we obtain

 1 1   1 1 

EF Nη Im Tr − EF Nη Im Tr (17.32) N Hγ − z N Hγ−1 − z   1  2 2 F 0(Ξ) Nηξ − ξ  + F 00(Ξ) Nηξ  − Nηξ  + .... 6 E v w 2 E v w We now substitute (17.31) into this expression. By the naive power counting

−m/2 (m) −m/2−1/3+Cε −5/2 −13/6+Cε (Nη)N |Rbv | 6 N , (Nη)N |Ωv| 6 N from the previous proof. The contributions of all the Ω error terms as well as the m = 4 terms clearly satisfy the aimed bound (17.30) and thus can be neglected. First, we prove that the quadratic term (as well as all higher order terms) in the Taylor expansion (17.32) is negligible. Here the cancellation between the v and w contributions is essential. It is therefore sufficient to consider 3 2 X −m/2−m0/2h (m) (m0) (m) (m0)i (Nη) N Rbv Rbv − Rbw Rbw m,m0=1 2 2 00 0 instead of [Nηξv] − [Nηξw] . Since F is bounded, any term with m + m > 4 is negligible similarly to the previous by power counting. The contribution of the m = m0 = 1 term to the quadratic term in (17.32) is explicitly given by

00  1 X   1 X   00   F (Ξ) R v R + R v R R 0 v R 0 + R 0 v R 0 − F (Ξ) v ↔ w . E N ki ij jk kj ji ik N k i ij jk k j ji ik E k k0

Since R and Ξ are independent of vij and wij the expectations with respect to vij and wij can be computed. Since their second moments coincide, the two expectations above exactly cancel each other. We are left with the m + m0 = 3 terms, we may assume m = 2 and m0 = 1. We do not expect any more cancellation from the v and w terms, so we consider them separately. A typical such term contributing to (17.30) with all the prefactors is of the form

2 2 −3/2 00  1 X  1 X  N (Nη) N F (Ξ) R v R v R R 0 v R 0 . N ki ij jj ji ik N k i ij jk k k0

The key point is that generically there are at least four off-diagonal elements, Rki, Rik, Rk0i and Rjk0 (we chose the ”worst” term, where the resolvent Rjj in the middle is diagonal). So together with the boundedness of F 00, the direct power counting gives N 2(Nη)2N −3/2(N −1/3+Cε)4 = N −1/6+Cε, which is exactly the precision required in (17.30). In case of each coinciding index (for example k = i), two off-diagonal elements may be ”lost”, but we gain a factor 1/N from the restriction of the summation. So terms with coinciding indices are negligible, as before. Here we discussed only the quadratic term in the Taylor expansion (17.32), but it is easy to see that higher order terms are even smaller, without any cancellation effect. So we are left with estimating the linear term in (17.32), i.e., we need to bound

3 2 X −m/2 0 h (m) (m)i N (Nη) N EF (Ξ) Rbv − Rbw , m=1

148 where we already took into account that the m = 4 term as well as the Ωv error term from (17.31) are negligible by direct power counting. If F 0(Ξ) were deterministic, then the previous argument leading to (17.15) would also prove the N −1/6+Cε bound for the linear term in (17.32). In fact, the exact cancellation (m) (m) between the expectations of the Rbv and Rbw terms for m = 1 and m = 2 relied only on computing the 0 expectation with respect to vij and wij and on the matching of the first and second moments. Since F (Ξ) 0 is independent of vij and wij, this argument remains valid even with the F (Ξ) factor included. We thus need to control the m = 3 term, i.e., show that

2 −3/2 0 (3) −1/6+Cε N (Nη)N EF (Ξ)Rbv 6 N . (17.33) The same bound then would also hold if v is replaced with w. Using the boundedness of F 0 and the argument between (17.24) and (17.28) in the previous proof, we see that for (17.33) it is sufficient to show that

(i) h i 2 −3/2 −3/2 X X 0 (i) (i) −1/6+Cε N (Nη)N N EF (Ξ) Rk` v`ivijvjivijRjk 6 N . (17.34) k `

0 Without the F (Ξ) term, the expectation with respect to v`i would render this term zero, in the generic case 0 when ` 6= j. In order to exploit this effect we need to expand F (Ξ) in the single variable v`i. Fix an index ` 6= j and let Q` be the matrix identical to Q except the (`, i) and (i, `) matrix elements ` ` −1 are zero, and let T = T = (Q − z) be its resolvent. Clearly, T is independent of vij and v`i and the local law (17.19) holds for the matrix elements of T as well. Using the resolvent expansion, we find

h 1 X i Ξ = (Nη) Im T + N −1/2T v T + T v T  + ...  , (17.35) N nn n` `i in ni i`i `n n where the higher order terms carry a prefactor N −1. We Taylor expand F 0(Ξ) around

1 X Ξ := (Nη) Im T , 0 N nn n and we obtain 1 h 1 X i F 0(Ξ) = F 0(Ξ ) + F 00(Ξ )(Nη) Im T v T + T v T + ...  + .... (17.36) 0 2 0 N 3/2 n` `i in ni i`i `n n

0 Since Ξ0 is independent of v`i, when we insert the above expansion for F (Ξ) into (17.34), the contribution 0 of the leading term F (Ξ0) gives zero after taking the expectation for v`i. The two subleading terms in (17.36), that are linear in v`i, have generically two off-diagonal elements, −1/3+Cε Tn` and Tin (or Tni and T`n) which have size N each. So the contribution of this term to (17.34) by simple power counting is

2 −3/2 −3/2 −1/2 −1/3+Cε 4 −1/6+Cε N (Nη)N N NN(Nη)N (N ) 6 N . Similar argument shows that the higher order terms in the Taylor expansion (17.36) as well as higher order terms in the resolvent expansion (17.35) have a negligible contribution. As before, we presented the argument for generic indices; coinciding indices give smaller contributions as we explained before. We leave the details to the reader. This completes the proof of (17.30) and thus the proof of Theorem 17.4.

149 18 Further results and historical notes

Spectral statistics of large random matrices have been studied from many different perspectives. The main focus of this book was on one particular result; the Wigner-Dyson-Mehta bulk universality for Wigner matrices in the average energy sense (Theorem 5.1). In this last section we collect a few other recent results and open questions of this very active research area. We also give references to related results and summarize the history of the development.

18.1 Wigner matrices: bulk universality in different senses Theorem 5.1, which is also valid for complex Hermitian ensembles, settles the average energy version of the Wigner-Dyson-Mehta conjecture for generalized Wigner matrices under the moment assumption (5.6) on the matrix elements. Theorem 5.1 was proved in increasing order of generality in [63], [69], [68] and [70]; see also [130] for the complex Hermitian Wigner ensemble and for a restricted class of real symmetric matrices (see comments about this class later on). The following theorem reduces the moment assumption to 4 + ε moments. The proof in [53] also provides an effective speed of convergence in (18.2). We also point out that the condition (2.6) can be somewhat relaxed, see Corollary 8.3 [55].

Theorem 18.1 (Universality with averaged energy). [53, Theorem 7.2] Suppose that H = (hij) is a complex Hermitian (respectively, real symmetric) generalized Wigner matrix. Suppose that for some constants ε > 0, C > 0, √ 4+ε

E Nhij 6 C. (18.1) n Let n ∈ N and O : R → R be a test function (i.e., compactly supported and continuous). Fix |E0| < 2 and −1+ξ ξ > 0, then with bN = N we have

Z E0+bN Z   dE 1  (n) (n)  α lim dα O(α) n pN − pG,N E + = 0 . (18.2) N→∞ 2b n % (E) N% (E) E0−bN N R sc sc (n) Here %sc is the semicircle law defined in (3.1), pN is the n-point correlation function of the eigenvalue (n) distribution of H (4.20), and pG,N is the n-point correlation function of an N × N GUE (respectively, GOE) matrix. A relatively straightforward consequence of Theorem 18.1 is the average gap universality formulated in the following corollary. The gap distributions of the comparison ensembles, i.e., the Gaussian ensembles, can be explicitly expressed in terms of a Fredholm determinant [37,40]. Recall the notation A, B := Z ∩ [A, B] for any real numbers A < B. J K Corollary 18.2 (Gap universality with averaged label). Let H be as in Theorem 18.1 and O be a test function of n variables. Fix small positive constants ξ, α > 0. Then for any integer j0 ∈ αN, (1 − α)N we have J K 1 X  G  lim E − E O N(λj − λj+1),N(λj − λj+2),...,N(λj − λj+n) = 0. (18.3) N→∞ 2N ξ ξ |j−j0|6N

Here λj’s are the ordered eigenvalues and for brevity we omitted the rescaling factor %(λj) from the argument of O. Moreover, E and EG denote the expectation with respect to the Wigner ensemble H and the Gaussian (GOE or GUE) ensemble, respectively. The next result shows that Theorem 18.1 holds under the same conditions without energy averaging. Theorem 18.3 (Universality at fixed energy). [26] Consider a complex Hermitian (respectively, real sym- metric) generalized Wigner matrix with moment condition (18.1). Then for any E with |E| < 2 we have Z   1  (n) (n)  α lim dα O(α) n pN − pG,N E + = 0 . (18.4) N→∞ n % (E) N% (E) R sc sc

150 The fixed energy universality (18.4) for the β = 2 (complex Hermitian) case was already known earlier, see [57, 66, 131]. This case is exceptional since the Harish-Chandra/Itzykson/Zuber identity allows one to compute correlation functions for Gaussian divisible ensembles. This method relies on an algebraic identity and has no generalization to other symmetry classes. Finally, the gap universality with fixed label asserts that (18.3) holds without averaging. Theorem 18.4 (Gap universality with fixed label). [67, Theorem 2.2] Consider a complex Hermitian (respec- tively, real symmetric) generalized Wigner matrix and assume subexponential decay of the matrix elements instead of (18.1). Then Corollary 18.2 holds without averaging:

 G   lim E − E O N(λj − λj+1),N(λj − λj+2),...,N(λj − λj+n) = 0, (18.5) N→∞ for any j ∈ αN, (1 − α)N with a fixed α > 0. More generally, for any k, m ∈ αN, (1 − α)N we have J K J K   lim EO (N%k)(λk − λk+1), (N%k)(λk − λk+2),..., (N%k)(λk − λk+n) (18.6) N→∞   G − E O (N%m)(λm − λm+1),..., (N%m)(λm − λm+n) = 0, where the local density %k is defined by %k := %sc(γk) with γk from (11.31). The second part (18.6) of this theorem asserts that the gap distribution is not only independent of the specific Wigner ensemble, but it is also universal throughout the bulk spectrum. This is the counterpart of the statement that the appropriately rescaled correlation functions (18.4) have a limit that is independent of E, see (4.38). Prior to [67], universality for a single gap was only achieved in the special case of the Gaussian unitary ensemble (GUE) in [128]; this statement then immediately implies the same result for complex Hermitian Wigner matrices satisfying the four moment matching condition.

18.2 Historical overview of the three-step strategy for bulk universality We summarize the recent history related to the bulk universality results in Section 18.1. The three step strategy first appeared implicitly in [57] in the context of complex Hermitian matrices and was conceptualized for all symmetry classes in [63] by linking it to the DBM. Several key components of this strategy appeared in the later work [63, 64, 69]. We will review the history of Steps 1-3 separately.

History of Step 1. The semicircle law was proved by Wigner for energy windows of order one [137]. Various improvements were made to shrink the spectral windows; in particular, results down to scale N −1/2 were obtained by combining the results of [11] and [79]. The result at the optimal scale, N −1, referred to as the local semicircle law, was established for Wigner matrices in a series of papers [60–62]. The method was based on a self-consistent equation for the Stieltjes transform of the eigenvalues, m(z), and the continuity in the imaginary part of the spectral parameter z. As a by-product, the optimal eigenvector delocalization estimate was proved. In order to deal with the generalized Wigner matrices, one needed to consider the self- N consistent equation of the vector (Gii(z))i=1, the diagonal matrix elements of the Green function, since in this case there is no closed equation for m(z) = N −1 Tr G(z) [68,69]. Together with the fluctuation averaging lemma, this method implied the optimal rigidity estimate of eigenvalues in the bulk in [68] and up to the edges in [70]. Furthermore was obtained for individual matrix elements of the resolvent Gij and not only for its trace. The estimate on Gii also provided a simple alternative proof of the eigenvector delocalization estimate. A comprehensive summary of these results can be found in [55]. Several further extensions concern weaker moment conditions and improvements of (log N)-powers, see e.g. [32,77,129] and references therein.

History of Step 2. The universality for Gaussian divisible ensembles was proved by Johansson [85] for complex Hermitian Wigner ensembles in a certain range of the bulk spectrum. This was the first result beyond

151 exact Gaussian calculations going back to Gaudin, Dyson and Mehta. It was later extended to complex sample covariance matrices on the entire bulk spectrum by Ben Arous and P´ech´e[19]. There were two major restrictions of this method: (i) the Gaussian component was required to be of order one independent of N; (ii) it relies on an explicit formula by Br´ezin-Hikami [30,31] for the correlation functions of eigenvalues which is valid only for Gaussian divisible ensembles with unitary invariant Gaussian component. The size of the Gaussian component was reduced to N −1+ε in [57] by using an improved formula for correlation functions and the local semicircle law from [60–62]. To eliminate the usage of an explicit formula, a conceptual approach for Step 2 via the local ergodicity of Dyson Brownian motion was initiated in [63]. In [64], a general theorem for the bulk universality was formulated which applies to all classical ensembles, i.e., real and complex Wigner matrices, real and complex sample covariance matrices and quaternion Wigner matrices. The relaxation time to local equilibrium proved in these two papers were not optimal; the optimal relaxation time, conjectured by Dyson, was obtained later in [70]. The DBM approach in these papers yields the average gap and hence the average energy universality, while the method via the Br´ezin-Hikami formula gives universality in a fixed energy sense, but is strictly restricted to the complex Hermitian case. The fixed energy universality for all symmetry classes was recently achieved in [26]. The method in this paper was still based on DBM, but the analysis of DBM was very different from the one discussed in this book.

−t/2 History of Step 3. Once universality was established for Gaussian divisible ensembles e H0 + (1 − −t 1/2 G e ) H (here we used the convention used in Chapter 5) for a sufficiently large class of H0 and sufficiently small t, the final step is to approximate the local eigenvalue distribution of a general Wigner matrix by that of a Gaussian divisible one. The first approximation result was obtained via the reversal heat flow in [57] which required smoothness of the distribution of matrix elements. Shortly after, Tao and Vu [130] proved the four moment theorem which asserts that the local statistics of two Wigner ensembles are the same provided that the first four moments of their matrix elements are identical. The third comparison method is the Green function comparison theorem, Theorem 16.1 [63, 68, 69], which was the main content of Section 16. It uses similar moment conditions as those appeared in [130], but it provides detailed information on the matrix elements of Green functions as well. Both [130] and the proof of the Green function comparison theorem [63, 68, 69] use the local semicircle law [60, 62] and the Lindeberg’s idea, which had appeared in the proof of the Wigner semicircle law by Chatterjee [34]. The approach [130] requires additional difficult estimates on level repulsions, while Theorem 16.1 follows directly from the estimates on the matrix elements of the Green function from Step 1 via standard resolvent expansions. Another comparison method reviewed in this book, Theorem 15.2, can be viewed as a variation of the GFT. In summary, the WDM conjecture was first solved in the complex Hermitian case and then for the real symmetric case. In the first case, the bulk universality for Hermitian matrices with smooth distribution was proved in [57]. The smoothness requirement was removed in [130] but the complex was still excluded; this was finally added in [58], see also [66, 131] for related discussions. The result of [130] also implies that bulk universality of symmetric Wigner matrices under the restriction that the first four moments of the matrix elements match those of GOE. The major roadblock to the WDM conjecture for real symmetric matrices was the lack of Step 2 in this case. For symmetric matrices, we do not know any Br´ezin-Hikami type formula, which was the backbone for the Step 2 in the Hermitian case. The resolution of this difficulty was to abandon any explicit formula and to prove Dyson’s conjecture on the relaxation of the Dyson Brownian Motion, see Step 2. Together with the Green function comparison theorem [63], bulk universality was proved for real symmetric matrices without any smoothness requirement. In particular, this result covered the real Bernoulli case [68] that has been a major objective in this subject due to the application to the adjacency matrices for random graphs. The flexibility of the DBM approach also allowed one to establish universality beyond the traditional Wigner ensembles. The first such result was bulk universality for generalized Wigner matrices [69], followed by many other models, see Section 18.6. Finally, the technical condition assumed in all these papers, i.e., that the probability distributions of the matrix elements decay subexponentially, was reduced to the (4 + ε)-moment assumption (18.1) by using the universality of Erd˝os-R´enyi matrices [53]. This yielded the bulk universality as formulated in Theorem 18.1.

152 18.3 Invariant ensembles and log-gases The explicit formula (4.3) for the joint density function of the eigenvalues for the invariant matrix ensemble defined in (4.2) gives rise to a direct statistical physics interpretation: it can be viewed a measure µ(N) on N point particles on the real line. Inspired by the Gibbs formalism, we may write it as

(N) −βNH(λ) µ (dλ) = pN (λ)dλ = const · e dλ, λ = (λ1, . . . , λN ), (18.7)

where N X 1 1 X H(λ) := V (λ ) − log(λ − λ ). 2 j N j i j=1 16i 0, i.e., at any inverse temperature. In the case of invariant ensembles, it is well-known that for V satisfying certain mild conditions the sequence of one-point correlation function, or density, associated with (18.7) has a limit as N → ∞ and the limiting equilibrium density %V (s) can be obtained as the unique minimizer of the functional Z Z Z I(ν) = V (t)ν(t)dt − log |t − s|ν(s)ν(t)dtds. R R R This is the analogue of the Wigner semicircle law for log-gases. 2 We assume that % = %V is supported on a single compact interval, [A, B] and % ∈ C (A, B). Moreover, we assume that V is regular in the sense that % is strictly positive on (A, B) and vanishes as a square root at the endpoints, i.e., √ + %(t) = sA t − A (1 + O (t − A)) , t → A , (18.8) for some constant sA > 0 and a similar condition holds at the upper edge. It is known that these conditions are satisfied if, for example, V is strictly convex. In this case %V satisfies the equation 1 Z % (s)ds V 0(t) = V (18.9) 2 t − s R for any t ∈ (A, B). For the Gaussian case, V (x) = x2/2, the equilibrium density is given by the semicircle law, %V = %sc, see (3.1). (N) Under these conditions, the density of µ converges to %V even locally down to the optimal scale and rigidity analogous to Theorem 11.5 holds:

µ(N)  − 2 +ξ − 1  −N c P |λj − γj,V | > N 3 min{k, N − k + 1} 3 6 e ,

where the quantiles γj,V of the density %V are defined by

j Z γj,V = %V (x)dx, (18.10) N A similarly to (11.31), see Theorem 2.4 [24]. The higher order correlation functions, given in Definition 4.2 and rescaled by the local density %β,V (E) around an energy E in the bulk, exhibit a universal behavior in a sense that they depend only on β but independent of the potential V . Similar result holds at the edges of the support of the density %β,V . This can be viewed as a special realization of the universality hypothesis for the invariant matrix ensemble. Historically, the universality for the special classical cases β = 1, 2, 4 had been extensively studied via orthogonal polynomial methods. These methods rely on explicit expressions of the correlation functions in

153 terms of Fredholm determinants involving orthogonal polynomials. The key advantage of this method is that it yields explicit formulas for the limiting gap distributions or correlation functions. However, these explicit expressions are available only for the specific values β = 1, 2, 4. Within the framework of this approach, the β = 1, 2, 4 cases of Theorem 18.5 below was proved for very general potentials for β = 2 and for analytic potentials with some additional conditions for β = 1, 4. This is a whole subject by itself and we can only refer the readers to some reviews or books [39, 90, 115]. The first general result for nonclassical β values is the following theorem proved in [23–25]. In these works, a new approach to the universality of β ensemble was initiated. It departed from the previously mentioned traditional approach using explicit expressions for the correlation functions. Instead, it relies on comparing the probability measures of the β ensembles to those of the Gaussian ones. This basic idea of comparing two probability measures directly was also used in the later works, [118] and [18], albeit in a different way, where Theorem 18.5 were reproved under different sets of conditions on the potential V .

Theorem 18.5 (Universality with averaged energy). Assume V ∈ C4(R), regular and let β > 0. Consider the (N) (N) (n) β-ensemble µ = µβ,V given in (18.7) with correlation functions pV,N defined analogously to (4.20). For 2 (n) the Gaussian case, V (x) = x /2, the correlation functions are denoted by pG,N . Let E0 ∈ (A, B) lie in the 0 −1+ξ interior of the support of % and similarly let E0 ∈ (−2, 2) be inside the support of %sc. Then for bN = N with some ξ > 0 we have " Z Z E0+bN dE 1 (n)  α  lim dα O(α) n pV,N E + (18.11) N→∞ n 2b %(E) N%(E) R E0−bN N 0 # Z E0+bN 0 dE 1 (n)  0 α  − 0 n pG,N E + 0 = 0 , 0 2bN %sc(E ) N%sc(E ) E0−bN

(N) i.e., the correlation functions of µβ,V averaged around E0 asymptotically coincide with those of the Gaussian case. In particular, they are independent of E0. Universality at fixed energy, i.e., (18.11) without averaging in E, was proven by M. Shcherbina in [118] for any β > 0 under certain regularity assumptions on the potentials. We remark that the gap distribution in the Gaussian case for any β > 0 can be described by an interesting stochastic process [135]. Theorem 18.5 immediately implies gap universality with averaged labels, exactly in the same way as Corollary 18.2 was deduced from Theorem 18.1. To formulate the gap universality with a fixed label, we recall the quantiles γj,V from (18.10) and we set

V %j := %V (γj,V ), and %j := %sc(γj) (18.12)

to be the limiting densities at the j-th quantiles. Let EµV and EG denote the expectation w.r.t. the measure 1 2 µV and its Gaussian counterpart for V (λ) = 2 λ . Theorem 18.6 (Gap universality with fixed label). [67, Theorem 2.3] Consider the setup of Theorem 18.5 and we also assume β > 1. Set some α > 0, then

  µV V V V lim E O (N%k )(λk − λk+1), (N%k )(λk − λk+2),..., (N%k )(λk − λk+n) (18.13) N→∞

  µG − E O (N%m)(λm − λm+1),..., (N%m)(λm − λm+n) = 0

for any k, m ∈ αN, (1 − α)N . In particular, the distribution of the rescaled gaps w.r.t. µV does not depend on the index k Jin the bulk. K

We point out that Theorem 18.5 holds for any β > 0, but Theorem 18.6 requires β > 1. This is partly due to that the De Giorgi-Nash-Moser regularity theory was used in [67]. This restriction was later removed by Bekerman-Figalli-Guionnet [18] under a higher regularity assumption on V and some additional hypothesis.

154 18.4 Universality at the spectral edge So far we stated results for the bulk of the spectrum. Similar results hold at the edge; in this case the “averaged” results are meaningless. In Theorem 17.1 we already presented a universality result for the largest eigenvalue for the Wigner ensemble. That proof easily extends to the joint distribution of finitely many extreme eigenvalues. The following theorem gives a bit more: they control finite dimensional distributions of the first N 1/4 eigenvalues and they also identify the limit distribution. We then give the analogous result for log gases. Theorem 18.7 (Universality at the edge for Wigner matrices). [24] Let H be a generalized Wigner ensemble with subexponentially decaying matrix elements. Fix n ∈ N, κ < 1/4 and a test function O of n variables. Then for any Λ ⊂ 1,N κ with |Λ| = n, we have J K    G  2/3 1/3  −χ E − E O N j (λj − γj) 6 N , j∈Λ

with some χ > 0, where EG is expectation w.r.t. the standard GOE or GUE ensemble depending on the symmetry class of H and γj’s are semicircle quantiles (11.31). The edge universality for Wigner matrices was first proved by Soshinikov [124] under the assumption that the distribution of the matrix elements is symmetric and has finite moments for all orders. Soshnikov used an elaborated moment matching method and the conditions on the moments are not easy to improve. A completely different method based on the Green function comparison theorem was initiated in [70]. This method is more flexible in many ways and it removes essentially all restrictions in previous works. In particular, the optimal moment condition on the distribution of the matrix elements was obtained by Lee and Yin [97]. All these works assume that the variances of the matrix elements are identical. The main point of Theorem 18.7 is to consider generalized Wigner matrices, i.e., matrices with non-constant variances. It was proved in [70, Theorem 17.1] that the edge statistics for any generalized Wigner matrix coincide with those of a generalized Gaussian Wigner matrix with the same variances, but it was not shown that the statistics are independent of the variances themselves. Theorem 18.7 provides this missing step and thus it proves the edge universality for the generalized Wigner matrices.

4 Theorem 18.8 (Universality at the edge for log-gases). [24] Let β > 1 and V (resp. Ve) be in C (R), regular such that the equilibrium density % (resp. % ) is supported on a single interval [A, B] (resp. [A, B]). V Ve e e Without loss of generality we assume that for both densities (18.8) holds with A = 0 and with the same κ constant sA. Fix n ∈ N, κ < 2/5. Then for any Λ ⊂ 1,N with |Λ| = n, we have J K    µV µ 2/3 1/3 −χ (E − E Ve )O N j (λj − γj) 6 N (18.14) j∈Λ

with some χ > 0. Here γj are the quantiles w.r.t. the density %V (18.10). The history of edge universality runs in parallel to the bulk one in the sense that most bulk methods were extended to the edge case. Similar to the bulk case, the edge universality was first proved via orthogonal polynomial methods for the classical values of β = 1, 2, 4 in [38,42,110,116]. The first edge universality results were given independently in [24] (valid for general potentials and β > 1) and, with a completely different method, in [91] (valid for strictly convex potentials and any β > 0). The later work [18] also covered the edge case and proved universality to β > 0 under some higher regularity assumption on V and an additional condition that can be checked for convex V . Some open questions concern the universal behavior at the non-regular edges where even the scaling may change.

18.5 Eigenvectors Universality questions have been traditionally formulated for eigenvectors, but they naturally extend to eigenvectors as well. They are closely related to two different important physical phenomena: delocalization and quantum ergodicity.

155 The concept of delocalization stems from the basic theory of random Schr¨odingeroperators describing a single quantum particle subject to a random potential V in Rd. If the potential decays at infinity, then the operator −∆ + V acting on L2(Rd) has a discrete pure point spectrum with L2-eigenfunctions, localized or bound states, in the low energy regime. In the high energy regime it has absolutely continuous spectrum with bounded but not L2-normalizable generalized eigenfunctions. The latter are also called delocalized or extended states and they correspond to scattering and conductance. If the potential V is a stationary, ergodic random field, then celebrated Anderson localization [9] occurs: at high disorder a dense pure point spectrum with exponentially decaying eigenfunctions appears even in the energy regime that would classically correspond to scattering states. The localized regime has been mathematically well understood for both the high disorder regime and the regime of low density of states (near the spectral edges) since the fundamental works of Fr¨ohlich and Spencer [73] and later by Aizenman and Molchanov [1]. The local spectral statistics is Poisson [107], reflecting the intuition that exponentially decaying eigenfunctions typically have no overlap, so their energies are independent. The low disorder regime, however, is mathematically wide open. In d = 3 or higher dimensions it is con- jectured that −∆+V has absolutely continuous spectrum (away from the spectral edges), the corresponding eigenfunctions are delocalized and the local eigenvalue statistics of the restriction of −∆ + V to a large finite box follow the GOE local statistics. In other words, a phase transition is believed to occur as the strength of the disorder varies; this is called the Anderson metal-insulator transition. Currently even the existence of this extended states regime is not known; this is considered one of the most important open question in . Only the Bethe lattice (regular tree) is understood [2, 86] but this model exhibits Poisson local statistics [3] due to its exponentially growing boundary. The strong link between local eigenvalue statistics and delocalization for random Schr¨odingeroperators naturally raises the question about the structure of the eigenvectors for an N × N Wigner matrix H. 2 Complete delocalization in this context would mean that any ` normalized eigenvector u = (u1, . . . , uN ) of 2 −1 H is substantially supported on each coordinate; ideally |ui| ≈ N for each i. Due to fluctuations, this can hold only in a somewhat weaker sense. The first type of results provide an upper bound on the `∞ norm in very high probability: 2 Theorem 18.9. Let uα, α = 1, 2,...,N, be the ` -normalized eigenvectors of a generalized Wigner matrix H satisfying the moment condition (5.6). Then for any (small) ε > 0 and (large) D > 0 we have  N ε  ∃α : ku k N −D. (18.15) P α ∞ > N 6 Under stronger decay conditions on H, the N ε tolerance factor may be changed to a log-power and the probability estimate improved to sub exponential. This theorem directly follows from the local semicircle law, Theorem 6.7. Indeed, by spectral decomposition

2 X |uα(i)| Im G (z) = η η−1|u (i)|2, z = E + iη, ii (E − λ )2 + η2 > α α α

ε −1 where we chose E in the η-vicinity of λα. Since from (6.31) the left hand side is bounded by N (Nη) with very high probability, we obtain (18.15). Under stronger decay conditions on H, the local semicircle law can be slightly improved and thus the N ε tolerance factor may be changed to a log-power and the probability estimate improved to sub exponential. Although this bound prevents concentration of eigenvectors onto a set of size less than N 1−ε, it does not −1/2 imply the “complete flatness” of eigenvectors in the sense that |uα(i)| ≈ N . Note that by a complete U(N) or O(N) symmetry, the eigenfunctions of a GUE or GOE matrix are distributed according to the Haar measure on the N-dimensional unit sphere, moreover the distribution of a particular coordinate uα(i) of an fixed eigenvector uα is asymptotically Gaussian. The same result holds for generalized Wigner matrices by Theorem 1.2 of [28] (a similar result was also obtained in [87, 132] under the condition that the first four moments match with those of the Gaussian):

Theorem 18.10 (Asymptotic normality of eigenvectors). Let uα be the eigenvectors of a generalized Wigner matrix with moment condition (5.6). Then there is a δ > 0 such that for any m ∈ N and I ⊂ TN :=

156 1/4 1−δ 1−δ 1/4 N 1√,N ∪ N ,N − N ∪ N − N ,N with |I| = m and for any unit vector q ∈ R , the vector J K J  m K J K N(|hq, uαi|)α∈I ∈ R is asymptotically normal in the sense of finite moments. Moreover, the study of eigenvectors leads to another fundamental problem of mathematical physics: quantum ergodicity. Recall the quantum ergodicity theorem (Shnirel’man [122], Colin de Verdi`ere[35] and Zelditch [140]) asserts that “most” eigenfunctions for the Laplacian on a compact Riemannian manifold with ergodic geodesic flow are completely flat. For d-regular graphs under certain assumptions on the injectivity radius and spectral gap of the adjacency matrices, similar results were proved for eigenvectors of the adjacency matrices [7]. A stronger notion of quantum ergodicity, the quantum unique ergodicity (QUE) proposed by Rudnick-Sarnak [114] demands that all high energy eigenfunctions become completely flat, and it supposedly holds for negatively curved compact Riemannian manifolds. One case for which QUE was rigorously proved concerns arithmetic surfaces, thanks to tools from and ergodic theory on homogeneous spaces [81, 82, 100]. For Wigner matrices, a probabilistic version of QUE was settled in Corollary 1.4 of [28]:

Theorem 18.11 (Quantum unique ergodicity for Wigner matrices). Let uα be the eigenvectors of a generalized Wigner matrix with moment condition (5.6). Then there exists ε > 0 such that for any deterministic 1 6 α 6 N and I ⊂ 1,N , for any δ > 0 we have J K ! X |I| N −ε |u (i)|2 − δ . (18.16) P α N > 6 δ2 i∈I

18.6 General mean field models Wigner matrices and their generalizations in Definition 2.1 are the most prominent mean field random matrix ensembles, but there are several other models of interest. Here we give a partial list. More detailed account of the results and citations can be found later on in the few representative references. The three step strategy opens up a path to investigate local spectral statistics of a large class of matrices. In many examples the limiting density of the eigenvalues is not the semicircle any more, so the Dyson Brownian motion is not in global equilibrium and Step 2 is more complicated. In the spirit of Dyson’s conjecture, however, gap universality already follows from local equilibration of DBM. Indeed, a local version of the DBM was used as a basic tool in [25]. A recent development to compare local and global dynamics of DBM was implemented by a coupling and homogenization argument in [26]. It turns out that as long as the initial condition of the DBM is regular on a certain scale, the local gap statistics reaches its equilibrium, i.e., the Wigner-Dyson-Mehta distribition, on the corresponding time scale [92] (see also [65] for a similar result). This concept reduces the proof of universality to a proof of the corresponding local law. One natural class of mean field ensembles are the deformed Wigner matrices; these are of the form H = V + W , where V is a diagonal matrix and W is a standard Wigner matrix. In fact, after a trivial rescaling, these matrices may be viewed as instances of the DBM with initial condition H0 = V , recalling −t/2 −t 1/2 that the law of the DBM at time t is given by Ht = e V + (1 − e ) W . Assuming that the diagonal elements of V have a limiting density %V , the limiting density of Ht is given by the free convolution of %V with the semicircle density. The corresponding local law was proved in [93] followed by the proof of bulk universality in [96] and the edge universality [95, 96]. The free convolution of two arbitrary densities, %α  %β, naturally appear in random matrix theory as the limiting density of the sum of two asymptotically free matrices A and B whose limiting densities are ∗ given by %α and %β. Moreover, for any deterministic A and B, the matrices A and UBU are asymptotically free [136], where U is a Haar distributed unitary matrix. Thus the eigenvalue density of A + UBU ∗ is given by the free convolution. The corresponding local law on the optimal scale in the bulk was given in [15]. Another way to generate a matrix ensemble with a limiting density different from the semicircle law is to consider Wigner-type matrices that are further generalizations of the universal Wigner matrices from Definition 2.1. These are complex Hermitian or real symmetric matrices with centered independent (up to symmetry) matrix elements but without the condition that the sum of the variances in each row is constant

157 (2.5). The limiting density is determined by the matrix of variances S by solving the corresponding Dyson- Schwinger equation. Depending on S, the density may exhibit square root singularity similar to the edge of the semicircle law or a cubic root singularity [5]. The corresponding optimal local law and bulk universality were proved in [4]. Different complications arise for adjacency matrices of sparse graphs. Each edge of the Erd˝os-R´enyi graph is chosen independently with probability p, so it is a mean field model with a semicircle density, but a typical realization of the matrix contains many zero elements if p  1. Optimal local law was proved in [56], bulk universality for p  N −1/3 in [53] and for p  N −1 in [83]. Another prominent model of random graphs is the uniform measure on d-regular graphs. The elements of the are weakly dependent since their sum in every row is d. For d  1 the limiting density is the semicircle law and optimal local law was obtained in [17]. Bulk universality was shown in the regime 1  d  N 2/3 in [16], Sample covariance matrices (2.9) and especially their deformations with finite rank deterministic matrices play an important role in statistics. The first application of the three step strategy to prove bulk universality was presented in [64]. The edge universality was achieved in [89,94]. For applications in principal component analysis, the main focus is on the extreme eigenvalues and the outliers whose detailed analysis was given in [22]. Here we have listed references for papers using methods closely related to this book. There are many references in this subject which we are unable to include here.

18.7 Beyond mean field models: band matrices Wigner predicted that universality should hold for any large quantum system, described by a Hamiltonian H, of sufficient complexity. After discretization of the underlying physical state space, the Hamilton operator is given by a matrix H whose matrix elements Hxy describe the quantum transition rate from state x to y. Generalized Wigner matrices correspond to a mean-field situation where transition from any state x to y is possible. In contrast, random Schr¨odingeroperators of the form −∆ + V defined on Zd allow transition only to neighboring lattice sites with a random on-site interaction, so they are prototypes of disordered quantum systems with a non-trivial spatial structure. As explained in Section 18.5, from the point of view of Anderson phase transition, generalized Wigner matrices are always in the delocalized regime. Random Schr¨odingeroperators in d = 1 dimension, or more general random matrices H representing one dimensional Hamiltonians with short range random hoppings, are in the localized regime. It is therefore natural to vary the range of interaction to detect the phase transition. One popular model interpolating between the Wigner matrices and random Schr¨odingeroperator is the random , see Example 6.1. In this model the physical state space, that labels the matrix elements, is equipped with a distance. Band matrices are characterized by the property that Hij becomes negligible if dist(i, j) exceeds a certain parameter, W , called the band width. A fundamental conjecture [75] states that the local spectral statistics of a band matrix H are governed by random matrix statistics for large W and by Poisson statistics for small W . The transition is conjectured√ to be sharp [75, 125] for the√ band matrices in one spatial dimension around the critical value W = N. In other words, if W  N, we expect the universality results of [26,57,63,67] to hold.√ Furthermore, the eigenvectors of H are expected to be completely delocalized in this range. For W  N, one expects that the eigenvectors are exponentially localized. This is the analogue of the celebrated√ Anderson metal-insulator transition for random band matrices. The only rigorous work indicating the N threshold concern the second mixed moments of the characteristic polynomial for a special class of Gaussian band matrices [119, 121]. The localization length for band matrices in one spatial dimension was recently investigated in numerous works. For general distribution of the matrix entries, eigenstates were proved to be localized [115] for W  N 1/8, and delocalization of most eigenvectors in a certain averaged sense holds for W  N 6/7 [50,51], improved to W  N 4/5 [54]. The Green’s function (H − z)−1 was controlled down to the scale Im z  W −1 in [69], implying a lower bound of order W for the localization length of all eigenvectors. When the entries are Gaussian with some specific covariance profiles, supersymmetry techniques are applicable to obtain stronger results. This approach has first been developed by physicists (see [49] for an overview); the rigorous analysis was initiated by Spencer (see [125] for an overview), with an accurate estimate on the expected density of

158 states on arbitrarily short scales for a three-dimensional band matrix ensemble in [44]. More recent works include universality for W > cN [120], and the control of the Green’s function down to the optimal scale Im z  N −1, hence delocalization in a strong sense for all eigenvectors, when W  N 6/7 [14] with first four moments matching the Gaussians ones (both results require a block structure and hold in part of the bulk spectrum). While delocalization and Wigner-Dyson-Mehta spectral statistics are expected to occur simultaneously, there is no rigorous argument directly linking them. The Dyson Brownian Motion, the cornerstone of the three step strategy, proves universality for matrices where each entry has a non trivial Gaussian component and the comparison ideas require that second moments match exactly. Therefore this approach cannot be applied to matrices with many zero entries. However, a combination of the quantum unique ergodicity with a mean field reduction strategy in [27] yields Wigner-Dyson-Mehta bulk universality for a large class of band matrices with general distribution in the large band width regime W > cN. In contrast to the bulk, universality at the spectral edge is much better understood: extreme eigenvalues follow the Tracy-Widom law for W  N 5/6, an essentially optimal condition [123].

159 References

[1] M. Aizenman and S. Molchanov. Localization at large disorder and at extreme energies: an elementary derivation. Comm. Math. Phys., 157(2):245–278, 1993.

[2] M. Aizenman, R. Sims, and S. Warzel. Absolutely continuous spectra of quantum tree graphs with weak disorder. Comm. Math. Phys., 264(2):371–389, 2006.

[3] M. Aizenman and S. Warzel. The canopy graph and level statistics for random operators on trees. Math. Phys. Anal. Geom., 9(4):291–333 (2007), 2006.

[4] O. Ajanki, L. Erd˝os, and T. Kr¨uger. Quadratic vector equations on complex upper half-plane. arXiv:1506.05095, June 2015.

[5] O. Ajanki, L. Erd˝os,and T. Kr¨uger.Universality for general Wigner-type matrices. arXiv:1506.05098, June 2015.

[6] G. Akemann, J. Baik, and P. Di Francesco, editors. The Oxford handbook of random matrix theory. Oxford University Press, Oxford, 2011.

[7] N. Anantharaman and E. Le Masson. Quantum ergodicity on large regular graphs. Duke Math. J., 164(4):723–765, 2015.

[8] G. W. Anderson, A. Guionnet, and O. Zeitouni. An introduction to random matrices, volume 118 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2010.

[9] P. W. Anderson. Absence of diffusion in certain random lattices. Phys. Rev., 109:1492–1505, Mar 1958.

[10] Z. D. Bai. Convergence rate of expected spectral distributions of large random matrices. I. Wigner matrices. Ann. Probab., 21(2):625–648, 1993.

[11] Z. D. Bai, B. Miao, and J. Tsay. Convergence rates of the spectral distributions of large Wigner matrices. Int. Math. J., 1(1):65–90, 2002.

[12] Z. D. Bai and Y. Q. Yin. Limit of the smallest eigenvalue of a large-dimensional sample covariance matrix. Ann. Probab., 21(3):1275–1294, 1993.

[13] D. Bakry and M. Emery.´ Diffusions hypercontractives. In S´eminaire de probabilit´es,XIX, 1983/84, volume 1123 of Lecture Notes in Math., pages 177–206. Springer, Berlin, 1985.

[14] Z. Bao and L. Erd˝os.Delocalization for a class of random block band matrices. Probab. Theory Related Fields, pages 1–104, 2016.

[15] Z. Bao, L. Erd˝os,and K. Schnelli. Local law of addition of random matrices on optimal scale. arXiv:1509.07080, September 2015.

[16] R. Bauerschmidt, J. Huang, A. Knowles, and H.-T. Yau. Bulk eigenvalue statistics for random regular graphs. arXiv:1505.06700, May 2015.

[17] R. Bauerschmidt, A. Knowles, and H.-T. Yau. Local semicircle law for random regular graphs. arXiv:1503.08702, March 2015.

[18] F. Bekerman, A. Figalli, and A. Guionnet. Transport maps for β-matrix models and universality. Comm. Math. Phys., 338(2):589–619, 2015.

[19] G. Ben Arous and S. P´ech´e. Universality of local eigenvalue statistics for some sample covariance matrices. Comm. Pure Appl. Math., 58(10):1316–1357, 2005.

160 [20] P. Bleher and A. Its. Semiclassical asymptotics of orthogonal polynomials, Riemann-Hilbert problem, and universality in the matrix model. Ann. of Math. (2), 150(1):185–266, 1999.

[21] A. Bloemendal, L. Erd˝os,A. Knowles, H.-T. Yau, and J. Yin. Isotropic local laws for sample covariance and generalized Wigner matrices. Electron. J. Probab., 19:no. 33, 53, 2014.

[22] A. Bloemendal, A. Knowles, H.-T. Yau, and J. Yin. On the principal components of sample covariance matrices. Probab. Theory Related Fields, 164(1-2):459–552, 2016.

[23] P. Bourgade, L. Erd˝os, and H.-T. Yau. Bulk universality of general β-ensembles with non-convex potential. J. Math. Phys., 53(9):095221, 19, 2012.

[24] P. Bourgade, L. Erd˝os,and H.-T. Yau. Edge universality of beta ensembles. Comm. Math. Phys., 332(1):261–353, 2014.

[25] P. Bourgade, L. Erd˝os, and H.-T. Yau. Universality of general β-ensembles. Duke Math. J., 163(6):1127–1190, 2014.

[26] P. Bourgade, L. Erd˝os,H.-T. Yau, and J. Yin. Fixed energy universality for generalized Wigner matrices. Comm. Pure Appl. Math., pages 1–67, 2015.

[27] P. Bourgade, L. Erd˝os,H.-T. Yau, and J. Yin. Universality for a class of random band matrices. arXiv:1602.02312, February 2016.

[28] P. Bourgade and H.-T. Yau. The Eigenvector Moment Flow and local Quantum Unique Ergodicity. arXiv:1312.1301, December 2013.

[29] H. Brascamp and E. H. Lieb. Best constants in Young’s inequality, its converse, and its generalization to more than three functions. Adv. Math., 20(2):151–173, 1976.

[30] E. Br´ezinand S. Hikami. Correlations of nearby levels induced by a random potential. Nuclear Phys. B, 479(3):697–706, 1996.

[31] E. Br´ezinand S. Hikami. Spectral form factor in a random matrix theory. Phys. Rev. E, 55(4):4067– 4083, 1997.

[32] C. Cacciapuoti, A. Maltsev, and B. Schlein. Bounds for the Stieltjes transform and the density of states of Wigner matrices. Probab. Theory Related Fields, 163(1-2):1–59, 2015.

[33] E. A. Carlen and M. Loss. Optimal smoothing and decay estimates for viscously damped conservation laws, with applications to the 2-D Navier-Stokes equation. Duke Math. J., 81(1):135–157 (1996), 1995. A celebration of John F. Nash, Jr.

[34] T. Chen. Localization lengths and Boltzmann limit for the Anderson model at small disorders in dimension 3. J. Stat. Phys., 120(1-2):279–337, 2005.

[35] Y. Colin de Verdi`ere.Ergodicit´eet fonctions propres du laplacien. Comm. Math. Phys., 102(3):497–502, 1985.

[36] E. B. Davies. The functional calculus. J. London Math. Soc., 52(1):166–176, 1995.

[37] P. Deift. Orthogonal polynomials and random matrices: a Riemann-Hilbert approach, volume 3 of Courant Lecture Notes in Mathematics. New York University, Courant Institute of Mathematical Sciences, New York; American Mathematical Society, Providence, RI, 1999.

[38] P. Deift and D. Gioev. Universality at the edge of the spectrum for unitary, orthogonal, and symplectic ensembles of random matrices. Comm. Pure Appl. Math., 60(6):867–910, 2007.

161 [39] P. Deift and D. Gioev. Universality in random matrix theory for orthogonal and symplectic ensembles. Int. Math. Res. Pap., 2007(2):Art. ID rpm004, 116, 2007.

[40] P. Deift and D. Gioev. Random matrix theory: invariant ensembles and universality, volume 18 of Courant Lecture Notes in Mathematics. Courant Institute of Mathematical Sciences, New York; American Mathematical Society, Providence, RI, 2009.

[41] P. Deift, T. Kriecherbauer, K. T-R McLaughlin, S. Venakides, and X. Zhou. Strong asymptotics of orthogonal polynomials with respect to exponential weights. Comm. Pure Appl. Math., 52(12):1491– 1552, 1999.

[42] P. Deift, T. Kriecherbauer, K. T.-R. McLaughlin, S. Venakides, and X. Zhou. Uniform asymptotics for polynomials orthogonal with respect to varying exponential weights and applications to universality questions in random matrix theory. Comm. Pure Appl. Math., 52(11):1335–1425, 1999.

[43] J.-D. Deuschel and D. W. Stroock. Large deviations, volume 137 of Pure and Applied Mathematics. Academic Press, Inc., Boston, MA, 1989.

[44] M. Disertori, H. Pinson, and T. Spencer. Density of states for random band matrices. Comm. Math. Phys., 232(1):83–124, 2002.

[45] F. J. Dyson. A Brownian-motion model for the eigenvalues of a random matrix. J. Math. Phys., 3:1191–1198, 1962.

[46] F. J. Dyson. Statistical theory of the energy levels of complex systems. i, ii, and iii. J. Math. Phys, 3:140–156, 157–165, 166–175, 1962.

[47] F. J. Dyson. Correlations between eigenvalues of a random matrix. Comm. Math. Phys., 19:235–250, 1970.

[48] D. E. Edmunds and H. Triebel. Sharp Sobolev embeddings and related Hardy inequalities: the critical case. Math. Nachr., 207:79–92, 1999.

[49] K. Efetov. Supersymmetry in disorder and chaos. Cambridge University Press, Cambridge, 1997.

[50] L. Erd˝osand A. Knowles. Quantum diffusion and delocalization for band matrices with general distribution. Ann. Henri Poincar´e, 12(7):1227–1319, 2011.

[51] L. Erd˝osand A. Knowles. Quantum diffusion and eigenfunction delocalization in a random band matrix model. Comm. Math. Phys., 303(2):509–554, 2011.

[52] L. Erd˝os,A. Knowles, and H.-T. Yau. Averaging fluctuations in resolvents of random band matrices. Ann. Henri Poincar´e, 14(8):1837–1926, 2013.

[53] L. Erd˝os,A. Knowles, H.-T. Yau, and J. Yin. Spectral statistics of Erd˝os-R´enyi Graphs II: Eigenvalue spacing and the extreme eigenvalues. Comm. Math. Phys., 314(3):587–640, 2012.

[54] L. Erd˝os,A. Knowles, H.-T. Yau, and J. Yin. Delocalization and diffusion profile for random band matrices. Comm. Math. Phys., 323(1):367–416, 2013.

[55] L. Erd˝os,A. Knowles, H.-T. Yau, and J. Yin. The local semicircle law for a general class of random matrices. Electron. J. Probab., 18:no. 59, 58, 2013.

[56] L. Erd˝os,A. Knowles, H.-T. Yau, and J. Yin. Spectral statistics of Erd˝os-R´enyi graphs I: Local semicircle law. Ann. Probab., 41(3B):2279–2375, 2013.

[57] L. Erd˝os,S. P´ech´e, J. A. Ram´ırez,B. Schlein, and H.-T. Yau. Bulk universality for Wigner matrices. Comm. Pure Appl. Math., 63(7):895–925, 2010.

162 [58] L. Erd˝os,J. Ram´ırez,B. Schlein, T. Tao, V. Vu, and H.-T. Yau. Bulk universality for Wigner Hermitian matrices with subexponential decay. Math. Res. Lett., 17(4):667–674, 2010. [59] L. Erd˝os,J. A. Ram´ırez,B. Schlein, and H.-T. Yau. Universality of sine-kernel for Wigner matrices with a small Gaussian perturbation. Electron. J. Probab., 15:no. 18, 526–603, 2010. [60] L. Erd˝os,B. Schlein, and H.-T. Yau. Local semicircle law and complete delocalization for Wigner random matrices. Comm. Math. Phys., 287(2):641–655, 2009. [61] L. Erd˝os,B. Schlein, and H.-T. Yau. Semicircle law on short scales and delocalization of eigenvectors for Wigner random matrices. Ann. Probab., 37(3):815–852, 2009. [62] L. Erd˝os,B. Schlein, and H.-T. Yau. Wegner estimate and level repulsion for Wigner random matrices. Int. Math. Res. Not., 2010(3):436–479, 2010. [63] L. Erd˝os,B. Schlein, and H.-T. Yau. Universality of random matrices and local relaxation flow. Invent. Math., 185(1):75–119, 2011. [64] L. Erd˝os,B. Schlein, H.-T. Yau, and J. Yin. The local relaxation flow approach to universality of the local statistics for random matrices. Ann. Inst. H. Poincar´eProbab. Statist., 48(1):1–46, 2012. [65] L. Erd˝osand K. Schnelli. Universality for Random Matrix Flows with Time-dependent Density. arXiv:1504.00650, April 2015. [66] L. Erd˝osand H.-T. Yau. A comment on the Wigner-Dyson-Mehta bulk universality conjecture for Wigner matrices. Electron. J. Probab., 17:no. 28, 5, 2012. [67] L. Erd˝osand H.-T. Yau. Gap universality of generalized Wigner and β-ensembles. J. Eur. Math. Soc. (JEMS), 17(8):1927–2036, 2015. [68] L. Erd˝os,H.-T. Yau, and J. Yin. Universality for generalized Wigner matrices with Bernoulli distri- bution. J. Comb., 2(1):15–81, 2011. [69] L. Erd˝os,H.-T. Yau, and J. Yin. Bulk universality for generalized Wigner matrices. Probab. Theory Related Fields, 154(1-2):341–407, 2012. [70] L. Erd˝os,H.-T. Yau, and J. Yin. Rigidity of eigenvalues of generalized Wigner matrices. Adv. Math., 229(3):1435–1515, 2012. [71] A. S. Fokas, A. R. It.s, and A. V. Kitaev. The isomonodromy approach to matrix models in 2D . Comm. Math. Phys., 147(2):395–430, 1992. [72] P. J. Forrester. Log-gases and random matrices, volume 34 of London Mathematical Society Monographs Series. Princeton University Press, Princeton, NJ, 2010. [73] J. Fr¨ohlich and T. Spencer. Absence of diffusion in the Anderson tight binding model for large disorder or low energy. Comm. Math. Phys., 88(2):151–184, 1983. [74] M. Fukushima, Y. Oshima,¯ and M. Takeda. Dirichlet forms and symmetric Markov processes, volume 19 of de Gruyter Studies in Mathematics. Walter de Gruyter & Co., Berlin, 1994. [75] Y. V. Fyodorov and A. D. Mirlin. Scaling properties of localization in random band matrices: a σ-model approach. Phys. Rev. Lett., 67(18):2405–2409, 1991. [76] V. L. Girko. Asymptotics of the distribution of the spectrum of random matrices. Uspekhi Mat. Nauk, 44(4(268)):7–34, 256, 1989. [77] F. G¨otze,A. Naumov, and A. Tikhomirov. Local semicircle law under moment conditions. Part I: The Stieltjes transform. arXiv:1510.07350, October 2015.

163 [78] L. Gross. Logarithmic sobolev inequalities. Amer. J. Math., 5:1061–1083, 1975.

[79] A. Guionnet and O. Zeitouni. Concentration of the spectral measure for large matrices. Electron. Comm. Probab., 5:119–136 (electronic), 2000.

[80] B. Helffer and J. Sj¨ostrand.On the correlation for Kac-like models in the convex case. J. Stat. Phys., 74(1-2):349–409, 1994.

[81] R. Holowinsky. Sieving for mass equidistribution. Ann. of Math. (2), 172(2):1499–1516, 2010.

[82] R. Holowinsky and K. Soundararajan. Mass equidistribution for Hecke eigenforms. Ann. of Math. (2), 172(2):1517–1528, 2010.

[83] J. Huang, B. Landon, and H.-T. Yau. Bulk universality of sparse random matrices. J. Math. Phys., 56(12):123301, 19, 2015.

[84] C. Itzykson and J. B. Zuber. The planar approximation. II. J. Math. Phys., 21(3):411–421, 1980.

[85] K. Johansson. Universality of the local spacing distribution in certain ensembles of Hermitian Wigner matrices. Comm. Math. Phys., 215(3):683–705, 2001.

[86] A. Klein. Absolutely continuous spectrum in the Anderson model on the Bethe lattice. Math. Res. Lett., 1(4):399–407, 1994.

[87] A. Knowles and J. Yin. Eigenvector distribution of Wigner matrices. Probab. Theory Related Fields, 155(3-4):543–582, 2013.

[88] A. Knowles and J. Yin. The isotropic semicircle law and deformation of Wigner matrices. Comm. Pure Appl. Math., 66(11):1663–1750, 2013.

[89] A. Knowles and J. Yin. Anisotropic local laws for random matrices. arXiv:1410.3516, October 2014.

[90] T. Kriecherbauer and M. Shcherbina. Fluctuations of eigenvalues of matrix models and their applica- tions. arXiv:1003.6121, March 2010.

[91] M. Krishnapur, B. Rider, and B. Vir´ag. Universality of the stochastic airy operator. Comm. Pure Appl. Math., 69(1):145–199, 2016.

[92] B. Landon and H.-T. Yau. Convergence of local statistics of Dyson Brownian motion. arXiv:1504.03605, April 2015.

[93] J. O. Lee and K. Schnelli. Local deformed semicircle law and complete delocalization for Wigner matrices with random potential. J. Math. Phys., 54(10):103504, 62, 2013.

[94] J. O. Lee and K. Schnelli. Tracy-Widom Distribution for the Largest Eigenvalue of Real Sample Covariance Matrices with General Population. arXiv:1409.4979, September 2014.

[95] J. O. Lee and K. Schnelli. Edge universality for deformed Wigner matrices. Rev. Math. Phys., 27(8):1550018, 94, 2015.

[96] J. O. Lee, K. Schnelli, B. Stetler, and H.-T. Yau. Bulk universality for deformed Wigner matrices. Ann. Probab., 44(3):2349–2425, 2016.

[97] J. O. Lee and J. Yin. A necessary and sufficient condition for edge universality of Wigner matrices. Duke Math. J., 163(1):117–173, 2014.

[98] E. Levin and D. S. Lubinsky. Universality limits in the bulk for varying measures. Adv. Math., 219(3):743–779, 2008.

164 [99] E. H. Lieb and M. Loss. Analysis, volume 14 of Graduate Studies in Mathematics. American Mathe- matical Society, Providence, RI, second edition, 2001.

[100] E. Lindenstrauss. Invariant measures and arithmetic quantum unique ergodicity. Ann. of Math. (2), 163(1):165–219, 2006.

[101] D. S. Lubinsky. A new approach to universality limits involving orthogonal polynomials. Ann. of Math. (2), 170(2):915–939, 2009.

[102] A. Lytova and L. Pastur. Central limit theorem for linear eigenvalue statistics of random matrices with independent entries. Ann. Probab., 37(5):1778–1840, 2009.

[103] V. A. Marˇcenko and L. A. Pastur. Distribution of eigenvalues in certain sets of random matrices. Mat. Sb. (N.S.), 72 (114):507–536, 1967.

[104] M. L. Mehta. A note on correlations between eigenvalues of a random matrix. Comm. Math. Phys., 20:245–250, 1971.

[105] M. L. Mehta. Random matrices. Academic Press, Inc., Boston, MA, second edition, 1991.

[106] M. L. Mehta and M. Gaudin. On the density of eigenvalues of a random matrix. Nuclear Phys. B, 18:420–427, 1960.

[107] N. Minami. Local fluctuation of the spectrum of a multidimensional Anderson tight binding model. Comm. Math. Phys., 177(3):709–725, 1996.

[108] A. Naddaf and T. Spencer. On homogenization and scaling limit of some gradient perturbations of a massless free field. Comm. Math. Phys., 183(1):55–84, 1997.

[109] L. Pastur. Spectra of random selfadjoint operators. Uspekhi Mat. Nauk, 28(1(169)):3–64, 1973.

[110] L. Pastur and M. Shcherbina. On the edge universality of the local eigenvalue statistics of matrix models. Mat. Fiz. Anal. Geom., 10(3):335–365, 2003.

[111] L. Pastur and M. Shcherbina. Bulk universality and related properties of Hermitian matrix models. J. Stat. Phys., 130(2):205–250, 2008.

[112] L. Pastur and M. Shcherbina. Eigenvalue distribution of large random matrices, volume 171 of Math- ematical Surveys and Monographs. American Mathematical Society, Providence, RI, 2011.

[113] N. S. Pillai and J. Yin. Universality of covariance matrices. Ann. Appl. Probab., 24(3):935–1001, 2014.

[114] Z. Rudnick and P. Sarnak. The behaviour of eigenstates of arithmetic hyperbolic manifolds. Comm. Math. Phys., 161(1):195–213, 1994.

[115] J. Schenker. Eigenvector localization for random band matrices with power law band width. Comm. Math. Phys., 290(3):1065–1097, 2009.

[116] M. Shcherbina. Edge universality for orthogonal ensembles of random matrices. J. Stat. Phys., 136(1):35–50, 2009.

[117] M. Shcherbina. Central limit theorem for linear eigenvalue statistics of the Wigner and sample covari- ance random matrices. Zh. Mat. Fiz. Anal. Geom., 7(2):176–192, 2011.

[118] M. Shcherbina. Change of variables as a method to study general β-models: bulk universality. J. Math. Phys., 55(4):043504, 23, 2014.

[119] T. Shcherbina. On the second mixed moment of the characteristic polynomials of 1D band matrices. Comm. Math. Phys., 328(1):45–82, 2014.

165 [120] T. Shcherbina. Universality of the local regime for the block band matrices with a finite number of blocks. J. Stat. Phys., 155(3):466–499, 2014. [121] T. Shcherbina. Universality of the second mixed moment of the characteristic polynomials of the 1D band matrices: real symmetric case. J. Math. Phys., 56(6):063303, 23, 2015. [122] A. I. Snirelˇ 0man. Ergodic properties of eigenfunctions. Uspekhi Mat. Nauk, 29(6(180)):181–182, 1974. [123] S. Sodin. The spectral edge of some random band matrices. Ann. of Math. (2), 172(3):2223–2251, 2010. [124] A. Soshnikov. Universality at the edge of the spectrum in Wigner random matrices. Comm. Math. Phys., 207(3):697–733, 1999. [125] T. Spencer. Random banded and sparse matrices. In The Oxford handbook of random matrix theory, pages 471–488. Oxford Univ. Press, Oxford, 2011. [126] D. W. Stroock. , an analytic view. Cambridge University Press, Cambridge, 1993. [127] T. Tao. Topics in random matrix theory, volume 132 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2012. [128] T. Tao. The asymptotic distribution of a single eigenvalue gap of a Wigner matrix. Probab. Theory Related Fields, 157(1-2):81–106, 2013. [129] T. Tao and V. Vu. Random matrices: localization of the eigenvalues and the necessity of four moments. Acta Math. Vietnam., 36(2):431–449, 2011. [130] T. Tao and V. Vu. Random matrices: universality of local eigenvalue statistics. Acta Math., 206(1):127– 204, 2011. [131] T. Tao and V. Vu. The Wigner-Dyson-Mehta bulk universality conjecture for Wigner matrices. Elec- tron. J. Probab., 16:no. 77, 2104–2121, 2011. [132] T. Tao and V. Vu. Random matrices: universal properties of eigenvectors. Random Matrices Theory Appl., 1(1):1150001, 27, 2012. [133] C. A. Tracy and H. Widom. Level-spacing distributions and the Airy kernel. Comm. Math. Phys., 159(1):151–174, 1994. [134] C. A. Tracy and H. Widom. On orthogonal and ensembles. Comm. Math. Phys., 177(3):727–754, 1996. [135] B. Valk´oand B. Vir´ag. Continuum limits of random matrices and the Brownian carousel. Invent. Math., 177(3):463–508, 2009. [136] D. Voiculescu. Limit laws for random matrices and free products. Invent. Math., 104(1):201–220, 1991. [137] E. P. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Ann. of Math. (2), 62:548–564, 1955. [138] J. Wishart. The generalised product moment distribution in samples from a normal multivariate population. Biometrika, 20A(1/2):32–52, 1928. [139] H.-T. Yau. Relative entropy and hydrodynamics of Ginzburg-Landau models. Lett. Math. Phys., 22(1):63–80, 1991. [140] S. Zelditch. Uniform distribution of eigenfunctions on compact hyperbolic surfaces. Duke Math. J., 55(4):919–941, 1987.

166 Index β-ensemble, 17 Green function continuity, 125 Gross’ hypercontractivity bound, 100 Admissible control parameter, 50 Airy function, 21 Haar measure, 16, 155 Airy kernel, 21 Harish-Chandra-Itzykson-Zuber formula, 6 Anderson localization, 154 Helffer-Sj¨ostrandformula, 75 Anderson transition, 154, 156 Herbst bound, 97 Hermite polynomial, 18 Backtracking sequence, 11 Bakry-Emery´ argument, 95 Integrated density of states, 77 Band width, 156 Interlacing of eigenvalues, 36 Brascamp-Lieb inequality, 102, 107 Invariant ensemble, 6, 15, 151 Invariant measure, 25 Catalan numbers, 12 Isotropic local law, 31, 31 Christoffel-Darboux formula, 19 Classical exponent, 17 Level repulsion, 81 Classical location of eigenvalues, 79 Local equilibrium, 7, 25 Concentration inequality, 97 Local relaxation flow, 113 Correlation function, 18 Local relaxation measure, 113 Local semicircle law, 6, 24, 30 Deformed Wigner matrix, 155 Log gas, 17, 151 Delocalization, 24, 154 Logarithmic Sobolev inequality (LSI), 94 Determinantal correlation, 19 Dirichlet form, 83, 94, 111 Marchenko-Pastur law, 11 Dirichlet form inequality, 114, 119 Marcinkiewicz-Zygmund inequality, 44 Dyson Brownian motion (DBM), 7, 25, 80 Mean field model, 10 Dyson conjecture, 7, 25, 84, 112 Moment matching, 138 Moment method, 11 Edge universality, 140, 153 Empirical distribution function, 77 Ornstein-Uhlenbeck process, 6, 24, 80, 125 Empirical eigenvalue distribution, 77 Empirical eigenvalue measure, 13 Partial expectation, 35, 46 Energy parameter, 13 Pinsker inequality, 91 Energy universality; averaged, fixed, 23, 148 Plancherel-Rotach asymptotics, 20, 21 Entropy inequality, 91 Quadratic large deviation, 44 Fluctuation averaging, 31, 46, 56, 63 Quantum (unique) ergodicity, 155, 157 Free convolution, 155 Quaternion determinant, 21

Gap universality; averaged, fixed, 23, 148 Random band matrix, 26, 156 Gaudin distribution, 5 Random Schr¨odingeroperator, 26, 154 Gaussian β-ensemble, 17, 25 Regular graph, 156 Gaussian divisible ensemble, 6, 24, 25 Regular potential, 151 Gaussian Orthogonal Ensemble (GOE), 5, 10 Relative entropy, 90 Gaussian Unitary Ensemble (GUE), 5, 10 Resolvent, 13, 27 Generalized Wigner matrix, 6, 9 Resolvent decoupling identities, 47 Generator, 84 Riemann-Hilbert method, 6 Gibbs inequality, 90 Rigidity of eigenvalues, 24, 79, 151 Gibbs measure, 16 Green function, 27 Sample covariance matrix, 10, 156 Green function comparison, 134, 140, 142 Schur formula, 34

167 Self-consistent equation, 31, 37, 46 Sine-kernel, 20 Single entry distribution, 10 Sparse graph, 156 Spectral domain, 29, 50 Spectral edge, 26 Spectral gap, 98 Spectral parameter, 13 Stieltjes transform, 13, 26, 74 Stochastic domination, 29

Three-step strategy, 6, 24, 149 Tracy-Widom law, 140

Universal Wigner matrix, 9

Vandermonde determinant, 15

Ward identity, 47 Weak local law, 31, 34 Wigner ensemble, 5, 9 Wigner semicircle law, 10, 26 Wigner surmise, 5 Wigner-Dyson-Mehta (WDM) conjecture, 5, 157 Wigner-type matrix, 155 Wishart matrix, 10

168