Topology Distance: A Topology-Based Approach For Evaluating Generative Adversarial Networks
Danijela Horak 1 Simiao Yu 1 Gholamreza Salimi-Khorshidi 1
Abstract resemble real images Xr, while D discriminates between Xg Automatic evaluation of the goodness of Gener- and Xr. G and D are trained by playing a two-player mini- ative Adversarial Networks (GANs) has been a max game in a competing manner. Such novel adversarial challenge for the field of machine learning. In training process is a key factor in GANs’ success: It im- this work, we propose a distance complementary plicitly defines a learnable objective that is flexibly adaptive to existing measures: Topology Distance (TD), to various complicated image-generation tasks, in which it the main idea behind which is to compare the would be difficult or impossible to explicitly define such an geometric and topological features of the latent objective. manifold of real data with those of generated data. One of the biggest challenges in the field of generative More specifically, we build Vietoris-Rips com- models – including for GANs – is the automatic evaluation plex on image features, and define TD based on of the goodness of such models (e.g., whether or not the the differences in persistent-homology groups of data generated by such models are similar to the data they the two manifolds. We compare TD with the were trained on). Unlike supervised learning, where the most commonly-used and relevant measures in goodness of the models can be assessed by comparing their the field, including Inception Score (IS), Frechet´ predictions with the actual labels, or in some other deep- Inception Distance (FID), Kernel Inception Dis- learning models where the goodness of the model can be tance (KID) and Geometry Score (GS), in a range assessed using the likelihood of the validation data under the of experiments on various datasets. We demon- distribution that the real data comes from, in most state of strate the unique advantage and superiority of our the art generative models we do not know this distribution proposed approach over the aforementioned met- explicitly or can not rely on labels for such evaluations. rics. A combination of our empirical results and the theoretical argument we propose in favour Given that the data (or their corresponding features) in such of TD, strongly supports the claim that TD is a situations can be assumed to lie on a manifold embedded in powerful candidate metric that researchers can a high dimensional space (Goodfellow et al., 2016), tools employ when aiming to automatically evaluate from topology and geometry come as a natural choice when the goodness of GANs learning. studying differences between two data set. Hence, we pro- pose topology distance (TD) for the evaluation of GANs; it compares the the topological structures of two manifolds, 1. Introduction and calculates a distance between them to evaluate their (dis)similarities. We compare TD with widely-used and rel- arXiv:2002.12054v1 [cs.LG] 27 Feb 2020 Generative Adversarial Networks (GANs) (Goodfellow evant metrics, and demonstrate that it is more robust to noise et al., 2014) are a class of deep generative models that compared to competing distance measures on GAN’s, and have achieved unprecedented performance in generating it is better suited to distinguish among various shapes that high-quality and diverse images (Brock et al., 2019). They the data might come in. TD is able to evaluate GANs with have also been successfully applied to a variety of image- new insights different from other existing measurements. generation tasks, e.g. super resolution (Ledig et al., 2017), It can therefore be used either as an alternative to, or in image-to-image translation (Zhu et al., 2017), and text-to- conjunction with other metrics. image synthesis (Reed et al., 2016), to name a few. The GAN framework consists of a generator G and a discrimi- 1.1. Related work nator D, where G generates images Xg that are expected to There have been multiple metrics proposed to automatically 1 Investments AI, AIG, London, United Kingdom. Correspon- evaluate the performance of GANs. In this paper we focus dence to: Danijela Horak
Homology groups are algebraic constructions defined by
Hk(K, R) = Zk(K, R)/Bk(K, R), (4) where k is a non-negative integer, Zk(K, R) is a vector space of k-dimensional cycles and Bk(K, R) a k-dimensional boundaries, obtained as per-images and images of the bound- ary mapping on a chain complex, for more detailed account see (Hatcher, 2009). Typically a rank of a homology group ∞ ε4 of dimension zero would be the number of connected com- ponents of K, rank of H1 would be the number of one ε3 d dimensional holes in K, rank of H2 would be the number of cavities, and so on. ε2
A persistent homology, loosely speaking, is a homology of ε1 a topological space measured at different resolutions. More ε1 ε2 ε3 ε4 precisely, we study a nested sequence of topological spaces b (i.e., filtration K :K0 ⊂ K1 ⊂ ... ⊂ KN) and measure (calculate) homology at every step. As an example let’s observe the sequence of Vietoris-Rips complexes on a point Figure 1. (top) The 4 step filtration of Vitoris Rips complex on set V: A Vietoris-Rips complex on a vertex set V and a the set of 7 points with increasing radius 0 = ε1 ≤ ... ≤ ε4. diameter is a simplicial complex,in which v0,..., vk is a (bottom) The persistent diagram corresponding to the filtration in b simplex iff d(vi, vj) < for every i, j ≤ k. the figure on top: -axis denotes ”birth” or appearance of persistent homology group, and d-axis denotes ”death” or disappearance. An example of a filtration would be VR : VR(V, 0) ⊆ Red points represent persistent homology groups of dimension 0, VR(V, 1) ⊆ ... ⊆ VR(V, n), where 0 = 0 ≤ 1 ≤ and the green ones of dimension 1. ... ≤ n (see Figure1 (top)). In other words, persistent homology quantifies a change of topological invariants in VR with a change of parameter . There is a natural measure of distance defined on persistent Formally, i,j i j i diagrams, ∞-Wasserstein distance, also known in the com- Hk (VR) = Zk/Bk ∩ Zk, (5) munity as the bottleneck distance, which has desirable sta- where the j-th persistent homology group of dimension bility properties with respect to small perturbations (Chazal et al., 2016), but is sensitive to outliers and mostly unsuit- k of the i-th filtration complex VR(V, i) is denoted by i,j able for use in practice. On the other hand, p-Wasserstein Hk (VR). Intuitively, the persistent homology group records the “cycles” at the filtration step i, which have not distance become “boundaries” (i.e. which have not effectively disap- peared ) at filtration step j. For detailed account see (Edels- W (L )(PD0 , PD1) = brunner & Harer, 2008). q p k h X i1/p (7) The main insight when it comes to persistent homology is inf k u − η(u) kq , η:PD0 →PD1 k k 0 that the evolution of topological invariants over increase in u∈PDk parameter , can be encoded compactly in the form of a persistent diagram PD and a barcode. where 1 ≤ p, q ≤ ∞, and η ranges over all bijections PDk(VR) = {(bi, di) | i ∈ N, bi, di ∈ {0, . . . , n}}, between sets of persistent intervals in diagrams PD0 and (6) PD1, shows more potential as presented in (Chazal et al., where bi in the pair ((bi, di)) records the appearance (or 2018), but is computationally demanding. birth) of a k-dimensional homology group and d records its i In practice, much of the applications of persistent homology disappearance (also referred to as ”death”). In the event that have used neither of the two distances, but have relied on homology group persists, i.e. it does not disappear during ad-hoc distances between persistent diagrams, which do not the end of filtration, we set d = ∞. This set of points is i have a strong backing in theory (e.g. (Bendich et al., 2016; represented in the upper triangle of the first quadrant of the Khrulkov & Oseledets, 2018). R2 (see Figure1 (bottom)). Another, representation is a barcode where each bar is mapped to a point ((bi, di)) with To conclude, endowed with any distance measure described the starting point bi and ending point di. above, the space of persistent diagrams is a metric space. Topology Distance
4. Method Algorithm 1 This algorithm is to compute the longevity vector l for a set of images. l is of length n, which represents The method we propose for evaluation of the performance living times of all n homology classes throughout filtration of generative models rests on measuring the differences (see Section4 for more details). between the set of images generated by GANs and set of f ∗ original images. We measure the distance on the point cloud Require: θ : a pretrained feature extractor with parame- ters θ. data in feature space. Let Fr be the set of features of the Require: RC(p): a function for computing Vietoris- real, and Fg the set of features of the generated images Rips Filtration over the given input points p. represented in feature space: Rm. Require: PD(c): a function for computing persistent Seen as the point cloud data in Rm, one can calculate the homology in dimension 0 of filtration c. N×H×W×C distances between the points in Fr. It is worth noting here Input: X ∈ R : a set of images of size H × W again, that even though we calculate all the distances using with C channels. Euclidean metric, the algorithm will effectively use only ∗ ”small” distances, and this is in agreement with potential Compute: F ← fθ (X) non-zero curvature of the manifold(refer to Theorem 1 and Compute: C ← RC(F) Theorem 2 for full statement of this fact). Compute: l ← PD(C)
Assume that there are n data points in Fr and Fg, and let r r r g g 0 = d0 ≤ d1 ≤ ... ≤ dn(n−1)/2 and 0 = d0 ≤ d1 ≤ Algorithm 2 This algorithm is to compute Topology Dis- g ... ≤ dn(n−1)/2 be an array of sorted distances among tance (TD) between real and generated images. vectors in Fr, Fg, respectively. Then we observe the follow- N×H×W×C r r Input: Xr ∈ R : a set of real images of size ing filtration: VR(Fr) : VR(Fr, d0) ⊆ ... ⊆ VR(Fr, dk) g g H × W C and VR(F ) : VR(F , d ) ⊆ ... ⊆ VR(F , d ), where with channels. g g 0 g l X ∈ N×H×W×C the distance dr is the minimal distance d for which corre- Input: g R : a set of generated images of k H × W C sponding Vietoris Rips complex becomes fully connected. size with channels. Same is true for dr. The 0 dimensional persistent homology l l X groups are calculated on the aforementioned filtrations. One Compute: r with r using Algorithm1. l X consequence of studying only 0th dimensional persistent Compute: g with g using Algorithm1. TD(X , X ) ←k l − l k homology group, is that the rank of the persistent homology Compute: r g r g 2 group at time 0 will be exactly n, and persistent diagram will consist of n pairs (bi, di), where bi denotes the point in filtration where the observed homology group has appeared l(Fg) are the longevity vectors of persistent diagrams of for the first time (In our case bi = 0, for every i, due to filtrations of set of original and generated image features, the choice of filtration), and di denotes a point in filtration respectively. where the observed homology group(connected component) Furthermore, as some persistent pairs may contain ∞ we has merged with another one, or is equal to ∞ otherwise. will assume that the difference between two infinite coordi- This observation holds for both F and F . r g nates in 0, and the difference between ∞ and non-infinite As mentioned in Section3, a commonly used distance be- coordinate in our algorithm is a some fixed value larger than tween persistence diagrams in the field of topological data the maximum finite longevity. analysis is bottleneck distance as it has shown desirable Our method to translate persistent diagrams in feature vec- properties of stability. However, we have found that this tors is indeed ad-hoc, but the justification comes through distance is too sensitive to outliers and its practical applica- the fact that neither bottleneck distance nor Wasserstein tions are limited. Hence, we’ve chosen to define the distance distance showed satisfactory experimental results. The TD between the persistence diagrams differently. In fact we’ve algorithm is summarised in Algorithm1 and Algorithm2. used the inherent properties of our filtration method. We as- sign a n-dimensional vector l(Fr) = (d0 −b0,..., dn −bn), called the longevity vector to the persistent diagram which 5. Experiments represents the sorted living times of each homology group 5.1. Datasets and experimental setup for point set Fr. Same for Fg. Then, we define the Topology Distance (TD) between two We compared our proposed TD (lower is better) with IS persistent diagrams, and consequently between two corpuses (higher is better), FID (lower is better), KID (lower is bet- ter) and GS (lower is better) as introduced in Section 1.1. of images to be l2 distance between their longevity vectors, i.e. TD(F , F ) =k l(F ) − l(F ) k , where l(F ) and In addition to some simulated data, which we will intro- r g r g 2 r duce in the next Section, our experiments were carried Topology Distance out on the following four datasets: Fashion-MNIST (Xiao based entirely on moments cannot successfully distinguish et al., 2017), CIFAR10 (Krizhevsky, 2009), corrupted CI- between all probability distributions. FAR100 (CIFAR100-C) (Hendrycks & Dietterich, 2019) and CelebA (Liu et al., 2015). Wherever features were 50 128 12 TD FID KID required for computing the metric, we used a ResNet18 model (He et al., 2016) trained from scratch for Fashion- 45 120 11 MNIST images, and the Inception model (Szegedy et al., 40 112 10 2016) pretrained on ImageNet (Deng et al., 2009) for all other datasets. 35 104 9 TD FID KID We implemented our algorithm using Python version 30 96 8 of GUDHI1 for topology-related computation and Py-
Torch (Paszke et al., 2019) for building and training neural 25 88 7 network models. 20 80 6 pixel noise patch mask patch exchange
Figure 3. Comparison of FID, KID (×100) and TD on manipu- lated images (with pixel noise, patch mask and patch exchange). Results are averaged over 10 groups, each consisting of 500 real images (randomly sampled from the CelebA dataset) and the cor- responding manipulated counterparts.
In order to assess how such theoretical considerations will affect FID score’s performance, we first compared TD and FID on a synthetic dataset. As shown in Figure2 we aim to calculate the distance between a single Gaussian distribution and a mixture of two Gaussian distributions (the mixture has the same mean and variance as the single Gaussian). Given the identical first and second moments of the two point clouds in this case, as expected, FID cannot discriminate between the two, whereas the difference is very obvious when using TD. KID has similar limitations as FID, as Figure 2. Heat maps of distance matrices between 5 sample sets demonstrated in Figure2. (each of which has 600 samples randomly sampled from a single Next, we compared TD with FID and KID on real images, 1 Gaussian distribution, denoted by si ), and another 5 sample sets randomly sampled from CelebA dataset; the goal was to (each containing 600 samples from a mixture of two Gaussian compare the actual images with their manipulated coun- distributions, denoted by s2). All sj s have the same first and i i terparts. More specifically, we performed three types of second moments. (a) FID. (b) KID. (c) GS. (d) TD. manipulations designed by (Liu et al., 2018), which resulted in three new image datasets; we then computed the distance 5.2. Comparison with FID and KID between each one of these manipulated image datasets and the original image dataset, using TD, FID and KID. The idea of basing the distance measure entirely on the first two moments (e.g., a la FID) can be an oversimplification of The three image manipulations include: 1) pixel noise (i.e., the underlying distributions at times, as describing certain adding a random noise to each pixel, where the noise is uni- 1 ± 0.13 distributions require the use of higher order statistics (e.g., formly sampled from the following interval: times third or fourth moments). Furthermore, if two distributions the maximum intensity of the image), 2) patch mask (7 out have identical moments of all orders, it is still possible of 64 evenly-divided regions of each image were masked for them to be different distributions (Romano & Siegel, by a random pixel from the image), and 3) patch exchange 1986). This leads to a conclusion that any distance metric (2 out of 16 evenly-divided regions of each image were ran- domly exchanged with each other, performed twice). Some 1http://gudhi.gforge.inria.fr/ example images after manipulation are shown in Figure3. Topology Distance
gaussian noise shot noise impulse noise speckle noise 40 1.50 0.75 12 80 32 1.25 32 0.5 0.60 30 9 60 24 1.00 24 0.4 0.45 TD TD TD TD 20 GS GS 6 GS GS 16 0.3 16 0.75 40 0.30 3 10 8 0.2 8 0.50 20 0.15 0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity defocus blur glass blur motion blur zoom blur 35 8 12 10 16 1.0 12 8 30 6 0.8 10 8 12 9 6 25 8 TD TD TD TD 4 GS 0.6 GS GS 6 GS 8 4 6 6 2 20 0.4 4 4 2 0.2 4 0 15 3 2 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity snow frost fog brightness 8 4 8 2.4 15 24 4.5 30 1.8 3 12 18 6 3.0 6 20 TD TD TD TD GS 1.2 GS GS 2 GS 9 12 10 4 1.5 0.6 1 6 4 6 0.0 0.0 0 2 0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity contrast elastic transform pixelate jpeg compression 80 20 2.0 8 20 0.36 3.2 40 60 1.6 7 15 16 0.30 2.4 30 1.2 6 40 12 TD TD TD TD 10 GS GS GS 0.24 GS 1.6 20 0.8 5 20 5 8 10 4 0.18 0.8 0.4 4 0.12 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 severity severity severity severity TD GS
Figure 4. Comparison of perturbation consistency between TD and GS (×1000) on CIFAR100-C dataset. Each row represents one group of perturbation (i.e. noise, blur, weather and digital), and there are four specific perturbation types in each group. Results are averaged over 10 groups, each of which consists of 500 real images (randomly sampled from the CIFAR100-C dataset) and the corresponding images with perturbation.
It is clear that the image quality increasingly worsens as we ogy to quantify dissimilarities between the latent manifolds go from pixel noise, to patch mask and patch exchange; we of data. There are, however, two major differences between expect to see this trend in the metrics. TD and GS. The first one is in the core method and the way topology is used to construct the distance (for more However, as presented in Figure3, FID and KID show details see Section4). The second one is that TD measures a decreasing trend (indicating increasingly better quality) distances between point could data in the feature space, over pixel noise, patch mask and patch exchange, which whereas geometry score is defined on raw pixels. is apparently opposite the human judgements. In other words, they fail to capture the worsening of image quality, Figure2c shows the heatmap of the distance matrix calcu- as expected from qualitative assessments presented in (Liu lated between a single Gaussian distribution and a mixture et al., 2018); unlike TD, which captures the change in the of two Gaussian distributions using GS. It is clear that TD quality of the manipulated images very well. Since all better discriminates between the samples from the two afore- three metrics are based on the features extracted by the mentioned distributions (see Figure2d). same Inception model, this experiment demonstrates that We then performed perturbation consistency comparison the superiority of TD over FID and KID is due to its effective between TD and GS using the CIFAR100-C dataset, in assessment of the topological properties of the point clouds which 16 different types of perturbations (grouped in four, (rather than their lower-order statistics). namely noise, blur, weather and digital) are applied to the original CIFAR100 images; for each type of perturbation 5.3. Comparison with GS there are five levels of severity. 5,000 images were randomly So far, we have attempted to demonstrate the effectiveness sampled from the real dataset, and split into 10 groups (each of topology in assessing the (dis)similarities of two point with 500 images); for each group, scores are calculated clouds. On the other hand, as noted earlier, both topology comparing the perturbed images and the original. For every distance and geometry score exploit the idea of using topol- perturbation, as severity increases, the average score (across Topology Distance
10 groups) should increase monotonically with it. As can be seen in Figure4, TD is able to capture levels of perturbation severity much better and consistently than GS for many types of perturbations (e.g. Gaussian noise, frost, and elastic transform). This demonstrates that TD trend is more consistent with perturbation trend than GS, which further demonstrates the advantages of using features over pixels when computing topological properties.
5.4. Comparison with IS Figure 6. Comparison of TD based on pixels and features on differ- In our next experiment, we compare TD with IS on CelebA ent datasets. Results are averaged over 10 groups, each of which dataset where there are only face images and thus no distinct consists of 500 real images (randomly sampled from the original classes exist. We trained a GAN model (WGAN-GP (Gul- dataset) and generated images. (a) Fashion-MNIST. (b) CIFAR10. rajani et al., 2017)) on the training set of CelebA; original images were cropped to be of size 64×64, and the model was then trained on them for 200 epochs with a batch size of MNIST(trained for 100 epochs) and CIFAR10 (trained for 64. We recorded TD and IS along with the training process: 200 epochs) datasets. We then computed pixel-based and every 4 epochs we fed the randomly sampled noise vector feature-based TD between images generated by WGAN- (remained fixed for different epochs) to the final model; we GP trained for different number of epochs and real images, then computed TD and IS on the generated and real images. randomly sampled from each dataset. As can be seen clearly from Figure6, for both datasets, feature-based TD are able to demonstrate better performance in terms of discrimination and consistency. This attributes to the better generalisation of learned features than raw pixels, which is one of the most significant advances of deep neural networks (Bengio et al., 2013).
6. Conclusion In this work, we introduced Topology Distance (TD), a novel metric to evaluate GANs by considering the topo- Figure 5. Comparison of TD and IS on generated images by logical structures of latent manifold of real and generated WGAN-GP along with training process on the CelebA dataset. images. In a range of experiments we have compared TD Results are averaged over 10 groups, each of which consists of with Inception Score (IS), Frechet´ Inception Distance (FID), 500 real images (randomly sampled from the original dataset) and Kernel Inception Distance (KID), and Geometry Score (GS), generated images. and have demonstrated its advantages and superiority over them in terms of consistency with human judgement, as well Figure5 shows the comparison results of TD and IS on as other quantitative measures of change in image quality. the CelebA dataset. TD shows a great correspondence to TD is capable of providing new insights for the evaluation the quality of generated images (i.e., decreasing trend with of GANs, and it thus can be used in conjunction with other the improved quality of images). By contrast, IS fails to metrics when evaluating GANs. do so; at the early stages of training (before 20 epochs) it decreases as the quality of the generated images increases References – in contrast to what is expected – and eventually loses its discrimination power at the remaining epochs. In summary, Alexandroff, P. uber¨ den allgemeinen dimensionsbegriff TD shows superiority over IS for evaluating the quality of und seine beziehungen zur elementaren geometrischen images from datasets such as CelebA. anschauung. Mathematische Annalen, 98:617–635, 1928.
5.5. Pixels vs. features Bendich, P., Marron, J. S., Miller, E., Pieloch, A., and Skwerer, S. Persistent homology analysis of brain artery Finally we performed an ablation study to compare the trees. Ann. Appl. Stat., 10(1):198–218, 2016. usage of pixels vs features when computing TD. We trained two WGAN-GP models, respectively, on Fashion- Bengio, Y., Courville, A., and Vincent, P. Representation Topology Distance
learning: A review and new perspectives. IEEE Trans. Kac, M. Can one hear the shape of a drum. Amer. Math. Pattern Anal. Mach. Intell., 35(8):17981828, 2013. Monthly, 73(4):1–23, 1966.
Birman, M. S. On existence conditions for wave operators. Khrulkov, V. and Oseledets, I. V. Geometry score: a method Izv. Akad. Nauk SSSR, 27(4):883–906, 1963. for comparing generative adversarial networks. In ICML, 2018. Borji, A. Pros and cons of GAN evaluation measures. arXiv preprint arXiv 1802.03446, 2018. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009. Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In ICLR, Ledig, C., Theis, L., Huszar, F., Caballero, J., Aitken, A. P., 2019. Tejani, A., Totz, J., Wang, Z., and Shi, W. Photo-realistic Chazal, F., De Silva, V., Glisse, M., and Oudot, S. The single image super-resolution using a generative adver- structure and stability of persistence modules. In Springer sarial network. In CVPR, 2017. Briefs in Mathematics, 2016. Liu, S., Wei, Y., Lu, J., and Zhou, J. An improved evaluation Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A., and framework for generative adversarial networks. arXiv Wasserman, L. Robust topological inference: Distance to preprint arXiv 1803.07474, 2018. a measure and kernel distance. J. Mach. Learn. Res., 18 Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face (159):1–40, 2018. attributes in the wild. In ICCV, 2015. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Mantuano, T. Discretization of compact riemannian man- L. ImageNet: A large-scale hierarchical image database. ifolds applied to the spectrum of laplacian. Ann. Glob. In CVPR, 2009. Anal. Geom., 27:33–46, 2005. Dey, T. K., Ranjan, P., and Wang, Y. Convergence, stability, and discrete approximation of laplace spectra. In SODA, Mantuano, T. Discretization of riemannian manifolds ap- Am. J. Math. 2010. plied to the hodge laplacian. , 130(6):1477– 1508, 2008. Donnelly, H. Spectral theory of complete riemannian mani- folds. Pure Appl. MATH. Q., 6(2):439–456, 2010. Mikoaj Bikowski, Dougal J. Sutherland, M. A. A. G. De- mystifying MMD GANs. In ICLR, 2018. Edelsbrunner, H. and Harer, J. Persistent homology - a survey. Discrete Comput. Geom., 453:257, 2008. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai- Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, Generative adversarial nets. In NeurIPS, 2014. L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. 2019. MIT Press, 2016. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Courville, A. C. Improved training of wasserstein GANs. and Lee, H. Generative adversarial text to image synthesis. In NeurIPS, 2017. In ICML, 2016. Hatcher, A. Algebraic Topology. Cambridge University Romano, J. P. and Siegel, A. Counterexamples in Probabil- Press, 2009. ity And Statistics. Chapman and Hall/CRC, 1986. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., learning for image recognition. In CVPR, 2016. Radford, A., and Chen, X. Improved techniques for training GANs. In NeurIPS, 2016. Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturba- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, tions. In ICLR, 2019. Z. Rethinking the inception architecture for computer vision. In CVPR, 2016. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a rule converge to a local nash equilibrium. In NeurIPS, novel image dataset for benchmarking machine learning 2017. algorithms. arXiv preprint arXiv 1708.07747, 2017. Topology Distance
Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adver- sarial networks. In ICCV, 2017. Supplementary Material for Topology Distance
A. Comparison of Sample Quality Correlation An essential property of a metric for evaluating GANs is its good correlation to the quality of generated samples. We thus performed a comprehensive comparison of image quality correlation, among Frechet´ Inception Distance (FID), Kernel Inception Distance (KID), Geometry Score (GS), Inception Score (IS) and our proposed Topology Distance (TD), on three datasets (i.e. Fashion-MNIST, CIFAR10 and CelebA). Specifically, for each dataset, we first trained a WGAN-GP model for 200 epochs with a batch size of 64. We then computed the scores of each metric between real images and generated images for every 4 epochs. Results were averaged over 10 groups, each of which consisted of 500 real images (randomly sampled from each original dataset) and corresponding generated images. The results are shown in FigureA. Compared with other metrics, TD demonstrates better (vs. GS and IS) or comparable (vs. FID and KID) performance of sample quality correlation. It is worth noting that the advantages of our proposed TD over other metrics (as we have shown in other experiments) make TD stand out as a robust and strong alternative to evaluate GANs. Topology Distance
Fashion-MNIST CIFAR10 CelebA
350 120 300 100 300 250 80 250 200 FID 60 200 150 40 100 20 150 50 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200 0.4 2.5 0.3 0.3 2.0
1.5 0.2
KID 0.2 1.0 0.1 0.1 0.5
0.0 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200
0.07 0.05 0.030 0.025 0.06 0.04 0.05 0.020 0.03
GS 0.04 0.015 0.02 0.03 0.010 0.01 0.02 0.005
0.01 0.00 0.000 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200
4.0 2.8 2.7 6 3.5 2.6 5 3.0 2.5 IS 2.5 4 2.4 2.0 3 2.3 1.5 2.2 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200
18 35 100 16 30
14 80 25 20 TD 12 60 15 10 40 10 8 20 5 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200 epochs epochs epochs
Figure A. Comparison of sample quality correlation among FID, KID, GS, IS and our proposed TD, on Fashion-MNIST, CIFAR10, and CelebA. Results are averaged over 10 groups, each of which consists of 500 real images (randomly sampled from the original dataset) and generated images.