<<

Distance: A Topology-Based Approach For Evaluating Generative Adversarial Networks

Danijela Horak 1 Simiao Yu 1 Gholamreza Salimi-Khorshidi 1

Abstract resemble real images Xr, while D discriminates between Xg Automatic evaluation of the goodness of Gener- and Xr. G and D are trained by playing a two-player mini- ative Adversarial Networks (GANs) has been a max game in a competing manner. Such novel adversarial challenge for the field of machine learning. In training process is a key factor in GANs’ success: It im- this , we propose a distance complementary plicitly defines a learnable objective that is flexibly adaptive to existing measures: Topology Distance (TD), to various complicated image-generation tasks, in which it the main idea behind which is to compare the would be difficult or impossible to explicitly define such an geometric and topological features of the latent objective. of real data with those of generated data. One of the biggest challenges in the field of generative More specifically, we build Vietoris-Rips com- models – including for GANs – is the automatic evaluation plex on image features, and define TD based on of the goodness of such models (e.g., whether or not the the differences in persistent-homology groups of data generated by such models are similar to the data they the two . We compare TD with the were trained on). Unlike supervised learning, where the most commonly-used and relevant measures in goodness of the models can be assessed by comparing their the field, including Inception Score (IS), Frechet´ predictions with the actual labels, or in some other deep- Inception Distance (FID), Kernel Inception Dis- learning models where the goodness of the model can be tance (KID) and Score (GS), in a range assessed using the likelihood of the validation data under the of experiments on various datasets. We demon- distribution that the real data comes from, in most state of strate the unique advantage and superiority of our the art generative models we do not know this distribution proposed approach over the aforementioned met- explicitly or can not rely on labels for such evaluations. rics. A combination of our empirical results and the theoretical argument we propose in favour Given that the data (or their corresponding features) in such of TD, strongly supports the claim that TD is a situations can be assumed to lie on a manifold embedded in powerful candidate that researchers can a high dimensional (Goodfellow et al., 2016), tools employ when aiming to automatically evaluate from topology and geometry come as a natural choice when the goodness of GANs learning. studying differences between two data . Hence, we pro- pose topology distance (TD) for the evaluation of GANs; it compares the the topological structures of two manifolds, 1. Introduction and calculates a distance between them to evaluate their (dis)similarities. We compare TD with widely-used and rel- arXiv:2002.12054v1 [cs.LG] 27 Feb 2020 Generative Adversarial Networks (GANs) (Goodfellow evant metrics, and demonstrate that it is more robust to noise et al., 2014) are a class of deep generative models that compared to competing distance measures on GAN’s, and have achieved unprecedented performance in generating it is better suited to distinguish among various shapes that high-quality and diverse images (Brock et al., 2019). They the data might come in. TD is able to evaluate GANs with have also been successfully applied to a variety of image- new insights different from other existing . generation tasks, e.g. super resolution (Ledig et al., 2017), It can therefore be used either as an alternative to, or in image-to-image translation (Zhu et al., 2017), and text-to- conjunction with other metrics. image synthesis (Reed et al., 2016), to name a few. The GAN framework consists of a generator G and a discrimi- 1.1. Related work nator D, where G generates images Xg that are expected to There have been multiple metrics proposed to automatically 1 Investments AI, AIG, London, United Kingdom. Correspon- evaluate the performance of GANs. In this paper we focus dence to: Danijela Horak . on the most commonly-used and relevant approaches (as follows); for a more comprehensive review of such measure- Topology Distance ments, please refer to (Borji, 2018). is that the target distance is computed by considering the geometric and topological properties of those latent features. Inception Score (IS) The main idea behind IS (Salimans et al., 2016) is that generated images of high quality are Geometry Score (GS) GS (Khrulkov & Oseledets, 2018) expected to meet two requirements: They should contain Geometry score is defined as l2 distance between means easily classifiable objects (i.e. the conditional label distribu- of the relative living-times (RLT) vectors associated with tion p(y|x) with low entropy) and should be diverse (i.e. the the two sets of images. RLT of a point cloud data (e.g., a marginal distribution p(y) with high entropy). IS measures group of images in the feature space) is an infinite vector the average KL between these two distributions: (u1, u2,...) whose i-th entry is a measure of persistent in- tervals having 1-persistent homology group rank equal to 1 Pn(n−1)/2 IS = exp( x∼p [KL(p(y|x) || p(y))]), (1) i. That is, ui = Ij(dj+1 − dj), where E g dn(n−1)/2 j=1 I equals 1 if the rank of persistent homology group of di- where p is the generative distribution. IS relies on a pre- j g mension 1 in interval [d , d ] is i, and zero otherwise. trained Inception model (Szegedy et al., 2016) for the classi- j j+1 Persistent homology parameters d , i ∈ [0 ... n(n − 1)/2] fication of the generated images. Therefore, a key limitation i are sorted distances in the observed point cloud data. of IS is that it is unable to evaluate the image types that are distinct from those that the Inception model was trained on. Geometry score exploits similar idea to the topological dis- tance, with the difference being in the underlying point Frechet´ Inception Distance (FID) and Kernel Inception cloud data used, dimensionality of the homology group and Distance (KID) Proposed by (Heusel et al., 2017), FID distance measure between the persistent diagrams. We claim relies on a pretrained Inception model, which maps each that our method better aligns with the existing theory in the image to a vector representation (or, features). Given two of computational algebraic topology and has superior groups of data in this (one from the real and experimental results. the other from the generated images), FID measures their similarities, assuming that the features are distributed as multivariate Gaussian; the distance will be the Frechet´ dis- 2. Main idea tance (also known as Wasserstein-2 distance) between the According to the manifold hypothesis (Goodfellow et al., two Gaussians: 2016), real world high dimensional data (and their features) 2 1 lie on a low dimensional manifold embedded in a high di- FID(p , p ) = µ − µ + Tr(Σ + Σ − 2(Σ Σ ) 2 ) r g r g 2 r g r g mensional space. The main idea of this paper is to compare (2) the latent manifold of the real data with that of the generated where p and p denote the feature distributions of real and r g data, based entirely on the topological properties of the data generated data, (µ , Σ ) and (µ , Σ ) denote the means r r g g samples from these two manifolds. Let and be the and covariances of the corresponding feature distributions, Mr Mg latent manifolds of the real and generated data, respectively. respectively. It has been shown that FID is more robust to We aim to compare these two manifolds using the finite noise (of certain types) than IS (Heusel et al., 2017), but its samples of points V from and V from . assumption of features following a multivariate Gaussian r Mr g Mg distribution might be an oversimplification. Most mainstream methods compare samples Vr and Vg using the lower order moments (e.g. (Heusel et al., 2017)) A similar metric to FID is KID (Mikoaj Bikowski, 2018), – similar to the way we compare two functions using their which computes the squared maximum mean discrepancy Taylor expansion, for instance. However, this would only be (MMD) between the features (learned also from a pretrained valid if the underlying manifold is an (zero Inception model) of real and generated images: ), as all moments of the samples are calculated 0 using . For a Riemannian manifold with KID(p , p ) = 0 [k(x , x )] r g Exr,xr∼pr r r 0 a nonzero curvature, this type of approach, at least in theory, + x ,x0 ∼p [k(x , x )] E g g g g g (3) would not work, and using instead of Euclidean −2 [k(x , x )] distance would agree more with the hypothesis. Exr∼pr,xg∼pg r g Here we propose the comparison of the two manifolds on where k denotes a polynomial kernel k(x, x0) = the basis of their topology and/or geometry. The ideal way ( 1 x|x0 + 1)3 with feature dimension d. Compared with d to compare two manifolds would be to infer if they are ge- FID, KID does not have any parametric form assumption ometrically equivalent, i.e. isometric. This, unfortunately, for feature distribution, and has a unbiased estimator. is not attainable. However, we could compare two mani- Our proposed TD is closely related to FID and KID in that folds by the means of eigenvalues of the Laplace-Beltrami it also measures the distance between latent features of real and generated data. However, the key distinction of TD Topology Distance operator1 on them. is little hope in recovering spectral properties of M from the point sample V, because we are unable to determine the The Laplace-Beltrami spectrum can be regarded as the set right value of . of squared frequencies that are associated to the modes of eigenvalues of an oscillating membrane defined on the A similar theorem to Theorem 1 applies to homology type manifold. The spectrum then, is an infinite sequence of of a manifold and its sample. eigenvalues, and satisfies some nice stability properties, Theorem 2 Given a Riemannian manifold , and a sam- whereby a small perturbation in the metric of the underlying M ple of points from it, V, which is sufficiently dense, then Riemannian manifold results in a small perturbation of the a VietorisRips complex of V at scale  is homologically spectrum (Donnelly, 2010; Birman, 1963). Furthermore, equivalent to . the Laplace-Beltrami spectrum is widely considered as a M “fingerprint” of a manifold. In 1966, in the famous paper This theorem is a direct consequence of the famous nerve “Can one hear the shape of a drum?” (Kac, 1966), M. Kac theorem (Alexandroff, 1928), but can also be seen as a has asked a question whether the eigenvalues of Laplace consequence of Theorem 1, due to a fact that the multiplicity Beltrami operator alone are sufficient to uniquely (up to of eigenvalue zero on discretised space is exactly the rank an ) identify a manifold. The answer is unfortu- of a homology group of dimension zero on the same spacee. nately not, but the isospectral manifolds are rare and when In practice, as before, one does not know how to choose they exist, they share multiple topological and geometric scale for , but unlike before, in this setting we have avail- features. able a tool that can, and is specifically designed to, deal with Furthermore, it is possible to translate this methodology to the uncertainty of scale: persistent homology. We chose to a discrete setting, such that the spectrum calculated on the utilise persistent homology to extract information about the discrete set relates closely to the spectrum on the manifold geometry and topology of M, because persistent homology, itself. measured on the sample V, is a reliable shape quantifier of . Theorem 1 (Mantuano, 2005) Given a discretisation G = M G(V, ) of a compact Riemannian manifold M which has non-negative sectional curvature κ, and non negative in- 3. Preliminaries jectivity , and for which Ricci( , g) ≥ −(n − 1)κg, M Intuitively speaking, topological space is any space on where n is dimension of a manifold, g is a Riemannian which the notion on neighbourhood can be defined. Hence, metric, then it is possible to associate the eigenvalues of all metric (and consequently all examples considered Laplace operator on a graph G, with the ones of the Laplace in this work) are topological spaces; the opposite is not Beltrami operator on (c λ (G) ≤ λ ( ) ≤ c λ (G)), M 1 k k M 2 k true (i.e., not all topological spaces can be endowed with a for all k < |V |. metric). The discretisation G(V, ) of a manifold , is a set of points M It is very difficult to directly assess whether two topological in whose distance is at least  and the union of the balls M spaces are equivalent (homeomorphic); instead topologists centred in the points of V with radius  which forms an use proxies to measure their . One of these prox- open cover of , denoted by U. A version of this theorem M ies are homology groups, denoted by H , k ∈ , which also holds for eigenvalues of higher dimensional version of k N0 loosely speaking encode the information on different types Laplace Beltrami operator, called Laplacede Rham operator of loops (of different dimensions) that can be observed in which reflects high dimensional topological and geometric the topological space. And here, the following implication properties of a manifold (Mantuano, 2008). This effectively holds: If two topological spaces are equivalent, then their means that for a sufficiently good sample V from , we M homology groups are isomorphic, but the opposite is not can claim that calculating the eigenvalues on V would effec- true. tively be as calculating them on M (see (Dey et al., 2010) for more results). In this work we will only be concerned with a special class of topological spaces called simplicial complexes. Simpli- In our case the manifold is unknown, and all we know M cial complex, commonly denoted by K, is a topological is a sample of points from it: V. In order to calculate space consisting of vertices in a set V and a set of faces the Laplace-Beltrami spectrum, we need to have a graph chosen from the partitive set of V, P(V), with the require- structure on V, which comes through Cechˇ complex on its ment that if W ∈ K, then all the subsets of W are also in K. cover U. To obtain the Cechˇ complex, one needs radii of One way to visualise the simplicial complex is to consider the balls in the cover, i.e., . This in itself poses a problem, vertices as points in n and m-dimensional faces as convex because it is difficult to determine the right value of . There R hulls of m + 1 vertices, i.e. edges, triangles, tetrahedra, etc. 1Only for Riemannian manifolds Topology Distance

Homology groups are algebraic constructions defined by

Hk(K, R) = Zk(K, R)/Bk(K, R), (4) where k is a non-negative integer, Zk(K, R) is a vector space of k-dimensional cycles and Bk(K, R) a k-dimensional boundaries, obtained as per-images and images of the bound- ary mapping on a chain complex, for more detailed account see (Hatcher, 2009). Typically a rank of a homology group ∞ ε4 of dimension zero would be the number of connected com- ponents of K, rank of H1 would be the number of one ε3 d dimensional holes in K, rank of H2 would be the number of cavities, and so on. ε2

A persistent homology, loosely speaking, is a homology of ε1 a topological space measured at different resolutions. More ε1 ε2 ε3 ε4 precisely, we study a nested sequence of topological spaces b (i.e., filtration K :K0 ⊂ K1 ⊂ ... ⊂ KN) and measure (calculate) homology at every step. As an example let’s observe the sequence of Vietoris-Rips complexes on a point Figure 1. (top) The 4 step filtration of Vitoris Rips complex on set V: A Vietoris-Rips complex on a vertex set V and a the set of 7 points with increasing radius 0 = ε1 ≤ ... ≤ ε4. diameter  is a simplicial complex,in which v0,..., vk is a (bottom) The persistent diagram corresponding to the filtration in b simplex iff d(vi, vj) <  for every i, j ≤ k. the figure on top: -axis denotes ”birth” or appearance of persistent homology group, and d-axis denotes ”death” or disappearance. An example of a filtration would be VR : VR(V, 0) ⊆ Red points represent persistent homology groups of dimension 0, VR(V, 1) ⊆ ... ⊆ VR(V, n), where 0 = 0 ≤ 1 ≤ and the green ones of dimension 1. ... ≤ n (see Figure1 (top)). In other words, persistent homology quantifies a change of topological invariants in VR with a change of parameter . There is a natural measure of distance defined on persistent Formally, i,j i j i diagrams, ∞-Wasserstein distance, also known in the com- Hk (VR) = Zk/Bk ∩ Zk, (5) munity as the bottleneck distance, which has desirable sta- where the j-th persistent homology group of dimension bility properties with respect to small perturbations (Chazal et al., 2016), but is sensitive to outliers and mostly unsuit- k of the i-th filtration complex VR(V, i) is denoted by i,j able for use in practice. On the other hand, p-Wasserstein Hk (VR). Intuitively, the persistent homology group records the “cycles” at the filtration step i, which have not distance become “boundaries” (i.e. which have not effectively disap- peared ) at filtration step j. For detailed account see (Edels- W (L )(PD0 , PD1) = brunner & Harer, 2008). q p k h X i1/p (7) The main insight when it comes to persistent homology is inf k u − η(u) kq , η:PD0 →PD1 k k 0 that the evolution of topological invariants over increase in u∈PDk parameter , can be encoded compactly in the form of a persistent diagram PD and a barcode. where 1 ≤ p, q ≤ ∞, and η ranges over all bijections PDk(VR) = {(bi, di) | i ∈ N, bi, di ∈ {0, . . . , n}}, between sets of persistent intervals in diagrams PD0 and (6) PD1, shows more potential as presented in (Chazal et al., where bi in the pair ((bi, di)) records the appearance (or 2018), but is computationally demanding. birth) of a k-dimensional homology group and d records its i In practice, much of the applications of persistent homology disappearance (also referred to as ”death”). In the event that have used neither of the two distances, but have relied on homology group persists, i.e. it does not disappear during ad-hoc distances between persistent diagrams, which do not the end of filtration, we set d = ∞. This set of points is i have a strong backing in theory (e.g. (Bendich et al., 2016; represented in the upper triangle of the first quadrant of the Khrulkov & Oseledets, 2018). R2 (see Figure1 (bottom)). Another, representation is a barcode where each bar is mapped to a point ((bi, di)) with To conclude, endowed with any distance measure described the starting point bi and ending point di. above, the space of persistent diagrams is a . Topology Distance

4. Method Algorithm 1 This algorithm is to compute the longevity vector l for a set of images. l is of n, which represents The method we propose for evaluation of the performance living times of all n homology classes throughout filtration of generative models rests on measuring the differences (see Section4 for more details). between the set of images generated by GANs and set of f ∗ original images. We measure the distance on the point cloud Require: θ : a pretrained feature extractor with parame- ters θ. data in feature space. Let Fr be the set of features of the Require: RC(p): a function for computing Vietoris- real, and Fg the set of features of the generated images Rips Filtration over the given input points p. represented in feature space: Rm. Require: PD(c): a function for computing persistent Seen as the point cloud data in Rm, one can calculate the homology in dimension 0 of filtration c. N×H×W×C distances between the points in Fr. It is worth noting here Input: X ∈ R : a set of images of size H × W again, that even though we calculate all the distances using with C channels. Euclidean metric, the algorithm will effectively use only ∗ ”small” distances, and this is in agreement with potential Compute: F ← fθ (X) non-zero curvature of the manifold(refer to Theorem 1 and Compute: C ← RC(F) Theorem 2 for full statement of this fact). Compute: l ← PD(C)

Assume that there are n data points in Fr and Fg, and let r r r g g 0 = d0 ≤ d1 ≤ ... ≤ dn(n−1)/2 and 0 = d0 ≤ d1 ≤ Algorithm 2 This algorithm is to compute Topology Dis- g ... ≤ dn(n−1)/2 be an array of sorted distances among tance (TD) between real and generated images. vectors in Fr, Fg, respectively. Then we observe the follow- N×H×W×C r r Input: Xr ∈ R : a set of real images of size ing filtration: VR(Fr) : VR(Fr, d0) ⊆ ... ⊆ VR(Fr, dk) g g H × W C and VR(F ) : VR(F , d ) ⊆ ... ⊆ VR(F , d ), where with channels. g g 0 g l X ∈ N×H×W×C the distance dr is the minimal distance d for which corre- Input: g R : a set of generated images of k H × W C sponding Vietoris Rips complex becomes fully connected. size with channels. Same is true for dr. The 0 dimensional persistent homology l l X groups are calculated on the aforementioned filtrations. One Compute: r with r using Algorithm1. l X consequence of studying only 0th dimensional persistent Compute: g with g using Algorithm1. TD(X , X ) ←k l − l k homology group, is that the rank of the persistent homology Compute: r g r g 2 group at time 0 will be exactly n, and persistent diagram will consist of n pairs (bi, di), where bi denotes the point in filtration where the observed homology group has appeared l(Fg) are the longevity vectors of persistent diagrams of for the first time (In our case bi = 0, for every i, due to filtrations of set of original and generated image features, the choice of filtration), and di denotes a point in filtration respectively. where the observed homology group(connected component) Furthermore, as some persistent pairs may contain ∞ we has merged with another one, or is equal to ∞ otherwise. will assume that the difference between two infinite coordi- This observation holds for both F and F . r g nates in 0, and the difference between ∞ and non-infinite As mentioned in Section3, a commonly used distance be- coordinate in our algorithm is a some fixed value larger than tween persistence diagrams in the field of topological data the maximum finite longevity. analysis is bottleneck distance as it has shown desirable Our method to translate persistent diagrams in feature vec- properties of stability. However, we have found that this tors is indeed ad-hoc, but the justification comes through distance is too sensitive to outliers and its practical applica- the fact that neither bottleneck distance nor Wasserstein tions are limited. Hence, we’ve chosen to define the distance distance showed satisfactory experimental results. The TD between the persistence diagrams differently. In fact we’ve algorithm is summarised in Algorithm1 and Algorithm2. used the inherent properties of our filtration method. We as- sign a n-dimensional vector l(Fr) = (d0 −b0,..., dn −bn), called the longevity vector to the persistent diagram which 5. Experiments represents the sorted living times of each homology group 5.1. Datasets and experimental setup for point set Fr. Same for Fg. Then, we define the Topology Distance (TD) between two We compared our proposed TD (lower is better) with IS persistent diagrams, and consequently between two corpuses (higher is better), FID (lower is better), KID (lower is bet- ter) and GS (lower is better) as introduced in Section 1.1. of images to be l2 distance between their longevity vectors, i.e. TD(F , F ) =k l(F ) − l(F ) k , where l(F ) and In addition to some simulated data, which we will intro- r g r g 2 r duce in the next Section, our experiments were carried Topology Distance out on the following four datasets: Fashion-MNIST (Xiao based entirely on moments cannot successfully distinguish et al., 2017), CIFAR10 (Krizhevsky, 2009), corrupted CI- between all probability distributions. FAR100 (CIFAR100-C) (Hendrycks & Dietterich, 2019) and CelebA (Liu et al., 2015). Wherever features were 50 128 12 TD FID KID required for computing the metric, we used a ResNet18 model (He et al., 2016) trained from scratch for Fashion- 45 120 11 MNIST images, and the Inception model (Szegedy et al., 40 112 10 2016) pretrained on ImageNet (Deng et al., 2009) for all other datasets. 35 104 9 TD FID KID We implemented our algorithm using Python version 30 96 8 of GUDHI1 for topology-related computation and Py-

Torch (Paszke et al., 2019) for building and training neural 25 88 7 network models. 20 80 6 pixel noise patch mask patch exchange

Figure 3. Comparison of FID, KID (×100) and TD on manipu- lated images (with pixel noise, patch mask and patch exchange). Results are averaged over 10 groups, each consisting of 500 real images (randomly sampled from the CelebA dataset) and the cor- responding manipulated counterparts.

In order to assess how such theoretical considerations will affect FID score’s performance, we first compared TD and FID on a synthetic dataset. As shown in Figure2 we aim to calculate the distance between a single Gaussian distribution and a mixture of two Gaussian distributions (the mixture has the same mean and variance as the single Gaussian). Given the identical first and second moments of the two point clouds in this case, as expected, FID cannot discriminate between the two, whereas the difference is very obvious when using TD. KID has similar limitations as FID, as Figure 2. Heat maps of distance matrices between 5 sample sets demonstrated in Figure2. (each of which has 600 samples randomly sampled from a single Next, we compared TD with FID and KID on real images, 1 Gaussian distribution, denoted by si ), and another 5 sample sets randomly sampled from CelebA dataset; the goal was to (each containing 600 samples from a mixture of two Gaussian compare the actual images with their manipulated coun- distributions, denoted by s2). All sj s have the same first and i i terparts. More specifically, we performed three types of second moments. (a) FID. (b) KID. (c) GS. (d) TD. manipulations designed by (Liu et al., 2018), which resulted in three new image datasets; we then computed the distance 5.2. Comparison with FID and KID between each one of these manipulated image datasets and the original image dataset, using TD, FID and KID. The idea of basing the distance measure entirely on the first two moments (e.g., a la FID) can be an oversimplification of The three image manipulations include: 1) pixel noise (i.e., the underlying distributions at times, as describing certain adding a random noise to each pixel, where the noise is uni- 1 ± 0.13 distributions require the use of higher order (e.g., formly sampled from the following interval: times third or fourth moments). Furthermore, if two distributions the maximum intensity of the image), 2) patch mask (7 out have identical moments of all orders, it is still possible of 64 evenly-divided regions of each image were masked for them to be different distributions (Romano & Siegel, by a random pixel from the image), and 3) patch exchange 1986). This leads to a conclusion that any distance metric (2 out of 16 evenly-divided regions of each image were ran- domly exchanged with each other, performed twice). Some 1http://gudhi.gforge.inria.fr/ example images after manipulation are shown in Figure3. Topology Distance

gaussian noise shot noise noise speckle noise 40 1.50 0.75 12 80 32 1.25 32 0.5 0.60 30 9 60 24 1.00 24 0.4 0.45 TD TD TD TD 20 GS GS 6 GS GS 16 0.3 16 0.75 40 0.30 3 10 8 0.2 8 0.50 20 0.15 0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity defocus blur glass blur blur zoom blur 35 8 12 10 16 1.0 12 8 30 6 0.8 10 8 12 9 6 25 8 TD TD TD TD 4 GS 0.6 GS GS 6 GS 8 4 6 6 2 20 0.4 4 4 2 0.2 4 0 15 3 2 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity snow frost fog brightness 8 4 8 2.4 15 24 4.5 30 1.8 3 12 18 6 3.0 6 20 TD TD TD TD GS 1.2 GS GS 2 GS 9 12 10 4 1.5 0.6 1 6 4 6 0.0 0.0 0 2 0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity contrast elastic transform pixelate jpeg compression 80 20 2.0 8 20 0.36 3.2 40 60 1.6 7 15 16 0.30 2.4 30 1.2 6 40 12 TD TD TD TD 10 GS GS GS 0.24 GS 1.6 20 0.8 5 20 5 8 10 4 0.18 0.8 0.4 4 0.12 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity TD GS

Figure 4. Comparison of perturbation consistency between TD and GS (×1000) on CIFAR100-C dataset. Each row represents one group of perturbation (i.e. noise, blur, weather and digital), and there are four specific perturbation types in each group. Results are averaged over 10 groups, each of which consists of 500 real images (randomly sampled from the CIFAR100-C dataset) and the corresponding images with perturbation.

It is clear that the image quality increasingly worsens as we ogy to quantify dissimilarities between the latent manifolds go from pixel noise, to patch mask and patch exchange; we of data. There are, however, two major differences between expect to see this trend in the metrics. TD and GS. The first one is in the core method and the way topology is used to construct the distance (for more However, as presented in Figure3, FID and KID show details see Section4). The second one is that TD measures a decreasing trend (indicating increasingly better quality) distances between point could data in the feature space, over pixel noise, patch mask and patch exchange, which whereas geometry score is defined on raw pixels. is apparently opposite the human judgements. In other words, they fail to capture the worsening of image quality, Figure2c shows the heatmap of the calcu- as expected from qualitative assessments presented in (Liu lated between a single Gaussian distribution and a mixture et al., 2018); unlike TD, which captures the change in the of two Gaussian distributions using GS. It is clear that TD quality of the manipulated images very well. Since all better discriminates between the samples from the two afore- three metrics are based on the features extracted by the mentioned distributions (see Figure2d). same Inception model, this experiment demonstrates that We then performed perturbation consistency comparison the superiority of TD over FID and KID is due to its effective between TD and GS using the CIFAR100-C dataset, in assessment of the topological properties of the point clouds which 16 different types of perturbations (grouped in four, (rather than their lower-order statistics). namely noise, blur, weather and digital) are applied to the original CIFAR100 images; for each type of perturbation 5.3. Comparison with GS there are five levels of severity. 5,000 images were randomly So far, we have attempted to demonstrate the effectiveness sampled from the real dataset, and split into 10 groups (each of topology in assessing the (dis)similarities of two point with 500 images); for each group, scores are calculated clouds. On the other hand, as noted earlier, both topology comparing the perturbed images and the original. For every distance and geometry score exploit the idea of using topol- perturbation, as severity increases, the average score (across Topology Distance

10 groups) should increase monotonically with it. As can be seen in Figure4, TD is able to capture levels of perturbation severity much better and consistently than GS for many types of perturbations (e.g. Gaussian noise, frost, and elastic transform). This demonstrates that TD trend is more consistent with perturbation trend than GS, which further demonstrates the advantages of using features over pixels when computing topological properties.

5.4. Comparison with IS Figure 6. Comparison of TD based on pixels and features on differ- In our next experiment, we compare TD with IS on CelebA ent datasets. Results are averaged over 10 groups, each of which dataset where there are only face images and thus no distinct consists of 500 real images (randomly sampled from the original classes exist. We trained a GAN model (WGAN-GP (Gul- dataset) and generated images. (a) Fashion-MNIST. (b) CIFAR10. rajani et al., 2017)) on the training set of CelebA; original images were cropped to be of size 64×64, and the model was then trained on them for 200 epochs with a batch size of MNIST(trained for 100 epochs) and CIFAR10 (trained for 64. We recorded TD and IS along with the training process: 200 epochs) datasets. We then computed pixel-based and every 4 epochs we fed the randomly sampled noise vector feature-based TD between images generated by WGAN- (remained fixed for different epochs) to the final model; we GP trained for different number of epochs and real images, then computed TD and IS on the generated and real images. randomly sampled from each dataset. As can be seen clearly from Figure6, for both datasets, feature-based TD are able to demonstrate better performance in terms of discrimination and consistency. This attributes to the better generalisation of learned features than raw pixels, which is one of the most significant advances of deep neural networks (Bengio et al., 2013).

6. Conclusion In this work, we introduced Topology Distance (TD), a novel metric to evaluate GANs by considering the topo- Figure 5. Comparison of TD and IS on generated images by logical structures of latent manifold of real and generated WGAN-GP along with training process on the CelebA dataset. images. In a range of experiments we have compared TD Results are averaged over 10 groups, each of which consists of with Inception Score (IS), Frechet´ Inception Distance (FID), 500 real images (randomly sampled from the original dataset) and Kernel Inception Distance (KID), and Geometry Score (GS), generated images. and have demonstrated its advantages and superiority over them in terms of consistency with human judgement, as well Figure5 shows the comparison results of TD and IS on as other quantitative measures of change in image quality. the CelebA dataset. TD shows a great correspondence to TD is capable of providing new insights for the evaluation the quality of generated images (i.e., decreasing trend with of GANs, and it thus can be used in conjunction with other the improved quality of images). By contrast, IS fails to metrics when evaluating GANs. do so; at the early stages of training (before 20 epochs) it decreases as the quality of the generated images increases References – in contrast to what is expected – and eventually loses its discrimination at the remaining epochs. In summary, Alexandroff, P. uber¨ den allgemeinen dimensionsbegriff TD shows superiority over IS for evaluating the quality of und seine beziehungen zur elementaren geometrischen images from datasets such as CelebA. anschauung. Mathematische Annalen, 98:617–635, 1928.

5.5. Pixels vs. features Bendich, P., Marron, J. S., Miller, E., Pieloch, A., and Skwerer, S. Persistent homology analysis of brain artery Finally we performed an ablation study to compare the trees. Ann. Appl. Stat., 10(1):198–218, 2016. usage of pixels vs features when computing TD. We trained two WGAN-GP models, respectively, on Fashion- Bengio, Y., Courville, A., and Vincent, P. Representation Topology Distance

learning: A review and new perspectives. IEEE Trans. Kac, M. Can one hear the shape of a drum. Amer. Math. Pattern Anal. Mach. Intell., 35(8):17981828, 2013. Monthly, 73(4):1–23, 1966.

Birman, M. S. On existence conditions for operators. Khrulkov, V. and Oseledets, I. V. Geometry score: a method Izv. Akad. Nauk SSSR, 27(4):883–906, 1963. for comparing generative adversarial networks. In ICML, 2018. Borji, A. Pros and cons of GAN evaluation measures. arXiv preprint arXiv 1802.03446, 2018. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009. Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In ICLR, Ledig, C., Theis, L., Huszar, F., Caballero, J., Aitken, A. P., 2019. Tejani, A., Totz, J., Wang, Z., and Shi, W. Photo-realistic Chazal, F., De Silva, V., Glisse, M., and Oudot, S. The single image super-resolution using a generative adver- structure and stability of persistence modules. In Springer sarial network. In CVPR, 2017. Briefs in Mathematics, 2016. Liu, S., Wei, Y., Lu, J., and Zhou, J. An improved evaluation Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A., and framework for generative adversarial networks. arXiv Wasserman, L. Robust topological inference: Distance to preprint arXiv 1803.07474, 2018. a measure and kernel distance. J. Mach. Learn. Res., 18 Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face (159):1–40, 2018. attributes in the wild. In ICCV, 2015. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Mantuano, T. Discretization of compact riemannian man- L. ImageNet: A large-scale hierarchical image database. ifolds applied to the spectrum of laplacian. Ann. Glob. In CVPR, 2009. Anal. Geom., 27:33–46, 2005. Dey, T. K., Ranjan, P., and Wang, Y. Convergence, stability, and discrete approximation of laplace spectra. In SODA, Mantuano, T. Discretization of riemannian manifolds ap- Am. J. Math. 2010. plied to the hodge laplacian. , 130(6):1477– 1508, 2008. Donnelly, H. Spectral theory of complete riemannian mani- folds. Pure Appl. MATH. Q., 6(2):439–456, 2010. Mikoaj Bikowski, Dougal J. Sutherland, M. A. A. G. De- mystifying MMD GANs. In ICLR, 2018. Edelsbrunner, H. and Harer, J. Persistent homology - a survey. Discrete Comput. Geom., 453:257, 2008. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai- Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, Generative adversarial nets. In NeurIPS, 2014. L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. 2019. MIT Press, 2016. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Courville, A. C. Improved training of wasserstein GANs. and Lee, H. Generative adversarial text to image synthesis. In NeurIPS, 2017. In ICML, 2016. Hatcher, A. Algebraic Topology. Cambridge University Romano, J. P. and Siegel, A. Counterexamples in Probabil- Press, 2009. ity And Statistics. Chapman and Hall/CRC, 1986. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., learning for image recognition. In CVPR, 2016. Radford, A., and Chen, X. Improved techniques for training GANs. In NeurIPS, 2016. Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturba- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, tions. In ICLR, 2019. Z. Rethinking the inception architecture for . In CVPR, 2016. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a rule converge to a local nash equilibrium. In NeurIPS, novel image dataset for benchmarking machine learning 2017. algorithms. arXiv preprint arXiv 1708.07747, 2017. Topology Distance

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adver- sarial networks. In ICCV, 2017. Supplementary Material for Topology Distance

A. Comparison of Sample Quality Correlation An essential property of a metric for evaluating GANs is its good correlation to the quality of generated samples. We thus performed a comprehensive comparison of image quality correlation, among Frechet´ Inception Distance (FID), Kernel Inception Distance (KID), Geometry Score (GS), Inception Score (IS) and our proposed Topology Distance (TD), on three datasets (i.e. Fashion-MNIST, CIFAR10 and CelebA). Specifically, for each dataset, we first trained a WGAN-GP model for 200 epochs with a batch size of 64. We then computed the scores of each metric between real images and generated images for every 4 epochs. Results were averaged over 10 groups, each of which consisted of 500 real images (randomly sampled from each original dataset) and corresponding generated images. The results are shown in FigureA. Compared with other metrics, TD demonstrates better (vs. GS and IS) or comparable (vs. FID and KID) performance of sample quality correlation. It is worth noting that the advantages of our proposed TD over other metrics (as we have shown in other experiments) make TD stand out as a robust and strong alternative to evaluate GANs. Topology Distance

Fashion-MNIST CIFAR10 CelebA

350 120 300 100 300 250 80 250 200 FID 60 200 150 40 100 20 150 50 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200 0.4 2.5 0.3 0.3 2.0

1.5 0.2

KID 0.2 1.0 0.1 0.1 0.5

0.0 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200

0.07 0.05 0.030 0.025 0.06 0.04 0.05 0.020 0.03

GS 0.04 0.015 0.02 0.03 0.010 0.01 0.02 0.005

0.01 0.00 0.000 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200

4.0 2.8 2.7 6 3.5 2.6 5 3.0 2.5 IS 2.5 4 2.4 2.0 3 2.3 1.5 2.2 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200

18 35 100 16 30

14 80 25 20 TD 12 60 15 10 40 10 8 20 5 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200 epochs epochs epochs

Figure A. Comparison of sample quality correlation among FID, KID, GS, IS and our proposed TD, on Fashion-MNIST, CIFAR10, and CelebA. Results are averaged over 10 groups, each of which consists of 500 real images (randomly sampled from the original dataset) and generated images.