Fair Generative Modeling via Weak Supervision

Kristy Choi 1 * Aditya Grover 1 * Trisha Singh 2 Rui Shu 1 Stefano Ermon 1

Abstract discriminatory nature of such systems and ways to mitigate Real-world datasets are often biased with respect it (Podesta et al., 2014). For example, some natural language to key demographic factors such as race and gen- generation systems trained on internet-scale datasets have der. Due to the latent nature of the underlying been shown to produce generations that are biased towards factors, detecting and mitigating bias is especially certain demographics (Sheng et al., 2019). challenging for unsupervised . A variety of socio-technical factors contribute to the dis- We present a weakly supervised algorithm for criminatory nature of ML systems (Barocas et al., 2018). A overcoming dataset bias for deep generative mod- major factor is the existence of biases in the training data els. Our approach requires access to an additional itself (Torralba et al., 2011; Tommasi et al., 2017). Since small, unlabeled reference dataset as the supervi- data is the fuel of ML, any existing bias in the dataset can be sion signal, thus sidestepping the need for explicit propagated to the learned model (Barocas & Selbst, 2016). labels on the underlying bias factors. Using this This is a particularly pressing concern for generative models supplementary dataset, we detect the bias in ex- which can easily amplify the bias by generating more of isting datasets via a density ratio technique and the biased data at test time. Further, learning a generative learn generative models which efficiently achieve model is fundamentally an problem the twin goals of: 1) data efficiency by using and hence, the bias factors of interest are typically latent. training examples from both biased and reference For example, while learning a generative model of human datasets for learning; and 2) data generation close faces, we often do not have access to attributes such as gen- in distribution to the reference dataset at test time. der, race, and age. Any existing bias in the dataset with Empirically, we demonstrate the efficacy of our respect to these attributes are easily picked by deep genera- approach which reduces bias w.r.t. latent factors tive models. See Figure1 for an illustration. by an average of up to 34.6% over baselines for comparable image generation using generative In this work, we present a weakly-supervised approach to adversarial networks. learning fair generative models in the presence of dataset bias. Our source of weak supervision is motivated by the observation that obtaining multiple unlabelled (biased) 1. Introduction datasets is relatively cheap for many domains in the big data era. Among these data sources, we may wish to generate Increasingly, many applications of machine learning (ML) samples that are close in distribution to a particular target involve data generation. Examples of such production level (reference) dataset.1 As a concrete example of such a ref- systems include Transformer-based models such as BERT erence, organizations such as the World Bank and biotech and GPT-3 for natural language generation (Vaswani et al., firms (23&me, 2016; Hong, 2016) typically follow several arXiv:1910.12008v2 [cs.LG] 30 Jun 2020 2017; Devlin et al., 2018; Radford et al., 2019; Brown et al., good practices to ensure representativeness in the datasets 2020), Wavenet for text-to- (Oord et al., that they collect, though such methods are unscalable to 2017), and a large number of creative applications such large sizes. We note that neither of our datasets need to Coconet used for designing the “first AI-powered be labeled w.r.t. the latent bias attributes and the size of Doodle” (Huang et al., 2017). As these generative appli- the reference dataset can be much smaller than the biased cations become more prevalent, it becomes increasingly dataset. Hence, the level of supervision we require is weak. important to consider questions with regards to the potential Using a reference dataset to augment a biased dataset, our *Equal contribution 1Department of Computer Science, Stan- goal is to learn a generative model that best approximates the ford University 2Department of Statistics, . Correspondence to: Kristy Choi . 1We note that while there may not be concept of a dataset de- void of bias, carefully designed representative data collection prac- Proceedings of the 37 th International Conference on Machine tices may be more accurately reflected in some data sources (Gebru Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by et al., 2018) and can be considered as reference datasets. the author(s). Fair Generative Modeling via Weak Supervision

Figure 1. Samples from a baseline BigGAN that reflect the gender bias underlying the true data distribution in CelebA. All faces above the orange line (67%) are classified as female, while the rest are labeled as male (33%). desired, reference data distribution. Simply using the refer- variational autoencoders (VAE) (Kingma & Welling, 2013; ence dataset alone for learning is an option, but this may not Rezende et al., 2014) and normalizing flows (Dinh et al., suffice since this dataset can be too small to learn an expres- 2014) or hybrids (Grover et al., 2018). Our bias mitigation sive model that accurately captures the underlying reference framework is agnostic to the above training approaches. data distribution. Our approach to learning a fair generative For generality, we consider expectation-based learning ob- model that is robust to biases in the larger training set is jectives, where `(·) is a per-example loss that depends on based on importance reweighting. In particular, we learn both examples x drawn from a dataset D and the model a generative model which reweighs the data points in the parameters θ: biased dataset based on the ratio of densities assigned by the biased data distribution as compared to the reference data T distribution. Since we do not have access to explicit densi- 1 X [`(x, θ)] ≈ `(x , θ) := L(θ; D) (1) ties assigned by either of the two distributions, we estimate Ex∼pdata T i the weights by using a probabilistic classifier (Sugiyama i=1 et al., 2012; Mohamed & Lakshminarayanan, 2016). The above expression encompasses a broad class of MLE We test our weakly-supervised approach on learning genera- and adversarial objectives. For example, if `(·) denotes the tive adversarial networks on the CelebA dataset (Ziwei Liu negative log-likelihood assigned to the point x as per p , & Tang, 2015). The dataset consists of attributes such as θ then we recover the MLE training objective. gender and hair color, which we use for designing biased and reference data splits and subsequent evaluation. We empirically demonstrate how the reweighting approach can 2.2. Dataset Bias offset dataset bias on a wide range of settings. In partic- The standard assumption for learning a generative model ular, we obtain improvements of up to 36.6% (49.3% for is that we have access to a sufficiently large dataset Dref bias bias =0.9 and 23.9% for =0.8) for single-attribute of training examples, where each x ∈ Dref is assumed dataset bias and 32.5% for multi-attribute dataset bias on to be sampled independently from a reference distribution average over baselines in reducing the bias with respect to pdata = pref . In practice however, collecting large datasets the latent factors for comparable sample quality. that are i.i.d. w.r.t. pref is difficult due to a variety of socio- technical factors. The sample complexity for learning high 2. Problem Setup dimensional distributions can even be doubly-exponential in the dimensions in many cases (Arora et al., 2018), sur- 2.1. Background passing the size of the largest available datasets. We assume there exists a true (unknown) data distribution We can partially offset this difficulty by considering data d pdata : X → R≥0 over a set of d observed variables x ∈ R . from alternate sources related to the target distribution, e.g., In generative modeling, our goal is to learn the parameters images scraped from the Internet. However, these additional θ ∈ Θ of a distribution pθ : X → R≥0 over the observed datapoints are not expected to be i.i.d. w.r.t. pref . variables x, such that the model distribution pθ is close to We characterize this phenomena as dataset bias, where we pdata. Depending on the choice of learning algorithm, dif- ferent approaches have been previously considered. Broadly, assume the availability of a dataset Dbias, such that the these include adversarial training e.g., GANs (Goodfellow examples x ∈ Dbias are sampled independently from a et al., 2014) and maximum likelihood estimation (MLE) e.g., biased (unknown) distribution pbias that is different from pref , but shares the same support. Fair Generative Modeling via Weak Supervision

2.3. Evaluation ignore Dbias and consider learning pθ based on Dref alone. Since we only consider proper losses w.r.t. p , global Evaluating generative models and fairness in machine learn- ref optimization of the objective in Eq. equation1 in a well- ing are both open areas of research. Our work is at the in- specified model family will recover the true data distribution tersection of these two fields and we propose the following as |D | → ∞. However, since D is finite in practice, metrics for measuring bias mitigation for data generation. ref ref this is likely to give poor sample quality even though the fairness discrepancy would be low. Sample Quality: We employ sample quality metrics e.g., Frechet Inception Distance (FID) (Heusel et al., 2017), Ker- On the other extreme, we can consider learning pθ based on nel Inception Distance (KID) (Li et al., 2017), etc. These the full dataset consisting of both Dref and Dbias. This pro- metrics match empirical expectations w.r.t. a reference data cedure will be data efficient and could lead to high sample distribution p and a model distribution pθ in a predefined quality, but it comes at the cost of fairness since the learned feature space e.g., the prefinal layer of activations of Incep- distribution will be heavily biased w.r.t. pref . tion Network (Szegedy et al., 2016). A lower score indicates that the learned model can better approximate pdata. For 3.2. Solution 1: Conditional Modeling the fairness context in particular, we are interested in mea- Our first proposal is to learn a generative model conditioned suring the discrepancy w.r.t. pref even if the model has been trained to use both D and D . We refer the reader to on the identity of the dataset used during training. Formally, ref bias p (x|y) y ∈ {0, 1} Supplement B.2 for more details on evaluation with FID. we learn a generative model θ where is a binary random variable indicating whether the model dis- tribution was learned to approximate the data distribution Fairness: Alternatively, we can evaluate bias of genera- corresponding to D (i.e., p ) or D (i.e., p ). By tive models specifically in the context of some sensitive ref ref bias bias sharing model parameters θ across the two values of y, we latent variables, say u ∈ k. For example, u may corre- R hope to leverage both data sources. At test time, condition- spond to the age and gender of an individual depicted via ing on y for D should result in fair generations. an image x. We emphasize that such attributes are unknown ref during training, and used only for evaluation at test time. As we demonstrate in Section4 however, this simple ap- proach does not achieve the intended effect in practice. The If we have access to a highly accurate predictor p(u|x) for likely cause is that the conditioning information is too weak the distribution of the sensitive attributes u conditioned on for the model to infer the bias factors and effectively dis- the observed x, we can evaluate the extent of bias mitigation tinguish between the two distributions. Next, we present via the discrepancies in the expected marginal likelihoods an alternate two-phased approach based on density ratio of u as per p and p . ref θ estimation which effectively overcomes the dataset bias in a Formally, we define the fairness discrepancy f for a genera- data-efficient manner. tive model pθ w.r.t. pref and sensitive attributes u: 3.3. Solution 2: Importance Reweighting f(pref , pθ) = |Epref [p(u|x)] − Epθ [p(u|x)]|2. (2) In practice, the expectations in Eq. equation2 can be com- Recall a trivial baseline in Section 3.1 which learns a gen- puted via Monte Carlo averaging. Again the lower is the erative model on the union of Dbias and Dref . This method discrepancy in the above two expectations, the better is the is problematic because it assigns equal weight to the loss learned model’s ability to mitigate dataset bias w.r.t. the contributions from each individual datapoint in our dataset sensitive attributes u. We refer the reader to SupplementE in Eq. equation1, regardless of whether the datapoint comes for more details on the fairness discrepancy metric. from Dbias or Dref . For example, in situations where the dataset bias causes a minority group to be underrepresented, this objective will encourage the model to focus on the 3. Bias Mitigation majority group such that the overall value of the loss is We assume a learning setting where we are given access to minimized on average with respect to a biased empirical distribution i.e., a weighted mixture of pbias and pref with a data source Dbias in addition to a dataset of training ex- weights proportional to |Dbias| and |Dref |. amples Dref . Our goal is to capitalize on both data sources Dbias and Dref for learning a model pθ that best approxi- Our key idea is to reweight the datapoints from Dbias during mates the target distribution pref . training such that the model learns to downweight over- represented data points from Dbias while simultaneously 3.1. Baselines upweighting the under-represented points from Dref . The challenge in the unsupervised context is that we do not We begin by discussing two baseline approaches at the ex- have direct supervision on which points are over- or under- treme ends of the spectrum. First, one could completely Fair Generative Modeling via Weak Supervision represented and by how much. To resolve this issue, we Algorithm 1 Learning Fair Generative Models consider importance sampling (Horvitz & Thompson, 1952). Input: Dbias, Dref , Classifier and Generative Model Whenever we are given data from two distributions, w.l.o.g. Architectures & Hyperparameters say p and q, and wish to evaluate a sample average w.r.t. p Output: Generative Model Parameters θ given samples from q, we can do so by reweighting the sam- ples from p by the ratio of densities assigned to the sampled 1: . Phase 1: Estimate importance weights points by p and q. In our setting, the distributions of inter- 2: Learn binary classifier c for distinguishing (Dbias,Y = est are p and p respectively. Hence, an importance bias ref 0) vs. (Dref ,Y = 1) weighted objective for learning from Dbias is: c(Y =1|x) 3: Estimate importance weight wˆ(x) ← c(Y =0|x) for all   x ∈ D (using Eq.5) pref (x) bias Ex∼pref [`(x, θ)] = Ex∼pbias `(x, θ) (3) 4: Set importance weight wˆ(x) ← 1 for all x ∈ Dref pbias(x) 5: . Phase 2: Minibatch on θ based on T 1 X weighted loss ≈ w(x )`(x , θ) := L(θ, D ) T i i bias 6: Initialize model parameters θ at random i=1 (4) 7: Set full dataset D ← Dbias ∪ Dref 8: while training do

pref (x) 9: Sample a batch of points B from D at random where w(xi) := is defined to be the importance pbias(x) 1 P 10: Set loss L(θ; D) ← |B| x ∈B wˆ(xi)`(xi, θ) weight for xi ∼ pbias. i 11: Estimate gradients ∇θL(θ; D) and update parame- θ Estimating density ratios via binary classification. To ters based on optimizer update rule 12: end while estimate the importance weights, we use a binary classifier 13: return θ as described below (Sugiyama et al., 2012). Consider a binary classification problem with classes Y ∈ {0, 1} with training data generated as follows. First, we fix a prior probability for p(Y = 1). Then, we repeatedly sample y ∼ p(Y ). If y = 1, we independently sample a datapoint x ∼ pref , else we sample x ∼ pbias. Then, as shown in Friedman et al.(2001), the ratio of densities pref and (Line 3). For the datapoints from Dref , we do not need to pbias assigned to an arbitrary point x can be recovered via a perform any reweighting and set the importance weights to Bayes optimal (probabilistic) classifier c∗ : X → [0, 1]: 1 (Line 4). Using the combined dataset Dbias ∪ Dref , we then learn the generative model pθ where the minibatch loss ∗ pref (x) c (Y = 1|x) for every gradient update weights the contributions from w(x) = = γ ∗ (5) pbias(x) 1 − c (Y = 1|x) each datapoint (Lines 6-12). where c(Y = 1|x) is the probability assigned by the clas- For a practical implementation, it is best to account for some sifier to the point x belonging to class Y = 1. Here, diagnostics and best practices while executing Algorithm1. γ = p(Y =0) is the ratio of marginals of the labels for two For density ratio estimation, we test that the classifier is cali- p(Y =1) brated on a held out set. This is a necessary (but insufficient) classes. check for the estimated density ratios to be meaningful. If In practice, we do not have direct access to either pbias or the classifier is miscalibrated, we can apply standard recali- pref and hence, our training data consists of points sampled bration techniques such as Platt scaling before estimating from the empirical data distributions defined uniformly over the importance weights. Furthermore, while optimizing Dref and Dbias. Further, we may not be able to learn a the model using a weighted objective, there can be an in- Bayes optimal classifier and denote the importance weights creased variance across the loss contributions from each estimated by the learned classifier c for a point x as wˆ(x). example in a minibatch due to importance weighting. We did not observe this in our experiments, but techniques such Our overall procedure is summarized in Algorithm1. We as normalization of weights within a batch can potentially use deep neural networks for parameterizing the binary clas- help control the unintended variance introduced within a sifier and the generative model. Given a biased and reference batch (Sugiyama et al., 2012). dataset along with the network architectures and other stan- dard hyperparameters (e.g., learning rate, optimizer etc.), Theoretical Analysis. The performance of Algorithm1 we first learn a probabilistic binary classifier (Line 2). The critically depends on the quality of estimated density ratios, learned classifier can provide importance weights for the which in turn is dictated by the training of the binary classi- datapoints from Dbias via estimates of the density ratios fier. We define the expected negative cross-entropy (NCE) Fair Generative Modeling via Weak Supervision

(a) single, bias=0.9 (b) single, bias=0.8 (c) multi

Figure 2. Distribution of importance weights for different latent subgroups. On average, The underrepresented subgroups are upweighted while the overrepresented subgroups are downweighted. objective for a classifier c as: 4. Empirical Evaluation 1 In this section, we are interested in investigating two broad NCE(c) := Epref (x)[log c(Y = 1|x)] γ + 1 questions empirically: γ + [log c(Y = 0|x)]. (6) γ + 1Epbias(x) 1. How well can we estimate density ratios for the pro- posed weak supervision setting? In the following result, we characterize the NCE loss for the 2. How effective is the reweighting technique for learn- Bayes optimal classifier. ing fair generative models on the fairness discrepancy Theorem 1. Let Z denote a set of unobserved bias vari- metric proposed in Section 2.3? ables. Suppose there exist two joint distributions pbias(x, z) and pref (x, z) over x ∈ X and z ∈ Z. Let pbias(x) and We further demonstrate the usefulness of our generated data pbias(z) denote the marginals over x and z for the joint in downstream applications such as for pbias(x, z) and similar notation for the joint pref (x, z): learning a fair classifier in Supplement F.3.

pbias(x|z = k) = pref (x|z = k) ∀k (7) Dataset. We consider the CelebA (Ziwei Liu & Tang, 2015) dataset, which is commonly used for benchmarking deep and p (x|z = k), p (x|z = k0) have disjoint supports bias bias generative models and comprises of images of faces with 40 for k 6= k0. Then, the negative cross-entropy of the Bayes labeled binary attributes. We use this attribute information to optimal classifier c∗ is given as: construct 3 different settings for partitioning the full dataset 1  1  into Dbias and Dref . NCE(c∗) = log γ + 1Epref (z) γb(z) + 1 • Setting 1 (single, bias=0.9): We set z to be a single γ  γb(z)  + Epbias(z) log . (8) bias variable corresponding to “gender” with values 0 γ + 1 γb(z) + 1 (female) and 1 (male) and b(z = 0) = 0.9. where b(z) = pbias(z)/pref (z). Specifically, this means that Dref contains the same fraction of male and female images whereas Dbias Proof. See SupplementA. contains 0.9 fraction of females and rest as males. • Setting 2 (single, bias=0.8): We use same bias vari- For example, as we shall see in our experiments in the fol- able (gender) as Setting 1 with b(z = 0) = 0.8. lowing section, the inputs x can correspond to face images, whereas the unobserved z represents sensitive bias factors • Setting 3 (multi): We set z as two bias variables cor- for a subgroup such as gender or ethnicity. The proportion responding to “gender” and “black hair”. In total, we of examples x belonging a subgroup can differ across the have 4 subgroups: females without black hair (00), biased and reference datasets with the relative proportions females with black hair (01), males without black given by b(z). Note that the above result only requires know- hair (10), and males with black hair (11). We set ing these relative proportions and not the true z for each b(z = 00) = 0.437, b(z = 01) = 0.063, b(z = 10) = x. The practical implication is that under the assumptions 0.415, b(z = 11) = 0.085. of Theorem1, we can check the quality of density ratios estimated by an arbitrary learned classifier c by comparing We emphasize that the attribute information is used only for its empirical NCE with the theoretical NCE of the Bayes designing controlled biased and reference datasets and faith- optimal classifier in Eq.8 (see Section 4.1). ful evaluation. Our algorithm does not explicitly require Fair Generative Modeling via Weak Supervision

(a) Samples generated via importance reweighting with subgroups separated by the orange line. For the 100 samples above, the classifier concludes 52 females and 48 males.

(b) Fairness Discrepancy (c) FID

Figure 3. Single-Attribute Dataset Bias Mitigation for bias=0.9. Lower discrepancy and FID is better. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. We find that on average, imp-weight outperforms the equi-weight baseline by 49.3% and the conditional baseline by 25.0% across all reference dataset sizes for bias mitigation. such labeled information. Additional information on con- Model Bayes optimal Empirical structing the dataset splits can be found in Supplement B.1. single, bias=0.9 0.591 0.605 Models. We train two classifiers for our experiments: (1) single, bias=0.8 0.642 0.650 the attribute (e.g. gender) classifier which we use to assess multi 0.619 0.654 the level of bias present in our final samples; and (2) the density ratio classifier. For both models, we use a variant Table 1. Comparison between the cross-entropy loss of the Bayes of ResNet18 (He et al., 2016) on the standard train and classifier and learned density ratio classifier. validation splits of CelebA. For the generative model, we used a BigGAN (Brock et al., 2018) trained to minimize the hinge loss (Lim & Ye, 2017; Tran et al., 2017) objective. In Figure2, we show the distribution of our importance Additional details regarding the architectural design and weights for the various latent subgroups. We find that across hyperparameters in SupplementC. all the considered settings, the underrepresented subgroups (e.g., males in Figure 2(a), 2(b), females with black hair 4.1. Density Ratio Estimation via Classifier in 2(c)) are upweighted on average (mean density ratio estimate > 1), while the overrepresented subgroups are For each of the three experiments settings, we can evaluate downweighted on average (mean density ratio estimate < the quality of the estimated density ratios by comparing 1). Also, as expected, the density ratio estimates are closer empirical estimates of the cross-entropy loss of the density to 1 when the bias is low (see Figure 2(a) v.s. 2(b)). ratio classifier with the cross-entropy loss of the Bayes op- timal classifier derived in Eq.8. We show the results in 4.2. Fair Data Generation Table1 for perc=1.0 where we find that the two losses are very close, suggesting that we obtain high-quality density We compare our importance weighted approach against ratio estimates that we can use for subsequently training three baselines: (1) equi-weight: a BigGAN trained on fair generative models. In SupplementD, we show a more the full dataset Dref ∪ Dbias that weighs every point equally; fine-grained analysis of the 0-1 accuracies and calibration (2) reference-only: a BigGAN trained on the refer- of the learned models. ence dataset Dref ; and (3) conditional: a conditional Fair Generative Modeling via Weak Supervision

(a) Samples generated via importance reweighting. For the 100 samples above, the classifier concludes 37 females and 20 males without black hair, 22 females and 21 males with black hair.

60 imp-weight 50 equi-weight conditional 40

30 FID

20

10

0 0.1 0.25 0.5 1.0 Ratio of unbiased to biased dataset sizes

(b) Fairness Discrepancy (c) FID

Figure 4. Mult-Attribute Dataset Bias Mitigation. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower discrepancy and FID is better. We find that on average, imp-weight outperforms the equi-weight baseline by 32.5% and the conditional baseline by 4.4% across all reference dataset sizes for bias mitigation.

BigGAN where the conditioning label indicates whether a baseline samples shown in Figure1, but is able to produce data point x is from Dref (y = 1) or Dbias(y = 0). In all our almost identical proportion of samples across the two gen- experiments, the reference-only variant which only ders. Similar observations hold for bias = 0.8, as shown uses the reference dataset Dref for learning however failed in Figure8 in the supplement. We refer the reader to Sup- to give any recognizable samples. For a clean presentation plement F.4 for corresponding results and analysis, as well of the results due to other methods, we hence ignore this as for additional results on the Shapes3D dataset (Burgess baseline in the results below and defer the reader to the & Kim, 2018). supplementary material for further results. 4.2.2. MULTI-ATTRIBUTE SPLIT We also vary the size of the balanced dataset Dref relative to the unbalanced dataset size |Dbias|: perc = {0.1, 0.25, 0.5, We conduct a similar experiment with a multi-attribute split 1.0}. Here, perc = 0.1 denotes |Dref | = 10% of |Dbias| and based on gender and the presence of black hair. The attribute perc = 1.0 denotes |Dref | = |Dbias|. classifier for the purpose of evaluation is now trained with a 4-way classification task instead of 2, and achieves an 4.2.1. SINGLE ATTRIBUTE SPLITS accuracy of roughly 88% on the test set. We train our attribute (gender) classifier for evaluation on Our model produces samples as shown in Figure4a with the entire CelebA training set, and achieve a level of 98% the discrepancy metrics shown in Figures4b, c respectively. accuracy on the held-out set. For each experimental setting, Even in this challenging setup involving two latent bias we evaluate bias mitigation based on the fairness discrep- factors, we find that the importance weighted approach again ancy metric (Eq.2) and also report sample quality based on outperforms the baselines in almost all cases in mitigating FID (Heusel et al., 2017). bias in the generated data while admitting only a slight deterioration in image quality overall. For the bias = 0.9 split, we show the samples generated via imp-weight in Figure3a and the resulting fairness discrepancies in Figure3b. Our framework generates sam- ples that are slightly lower quality than equi-weight Fair Generative Modeling via Weak Supervision

5. Related Work clustering assignments are balanced with respect to some sensitive attribute. Fairness & generative modeling. There is a rich body of work in fair ML, which focus on different notions of fairness Density ratio estimation using classifiers. The use of (e.g. demographic parity, equality of odds and opportunity) classifiers for estimating density ratios has a rich history and study methods by which models can perform tasks of prior works across ML (Sugiyama et al., 2012). For such as classification in a non-discriminatory way (Baro- deep generative modeling, density ratios estimated by clas- cas et al., 2018; Dwork et al., 2012; Heidari et al., 2018; sifiers have been used for expanding the class of various du Pin Calmon et al., 2018). Our focus is in the context of learning objectives (Nowozin et al., 2016; Mohamed & Lak- fair generative modeling. The vast majority of related work shminarayanan, 2016; Grover & Ermon, 2018), evaluation in this area is centered around fair and/or privacy preserving metrics based on two-sample tests (Gretton et al., 2007; representation learning, which exploit tools from adversar- Bowman et al., 2015; Lopez-Paz & Oquab, 2016; Danihelka ial learning and information theory among others (Zemel et al., 2017; Rosca et al., 2017; Im et al., 2018; Gulrajani et al., 2013; Edwards & Storkey, 2015; Louizos et al., 2015; et al., 2018), or improved Monte Carlo inference via these Beutel et al., 2017; Song et al., 2018; Adel et al., 2019). models (Grover et al., 2019; Azadi et al., 2018; Turner et al., A unifying principle among these methods is such that a 2018; Tao et al., 2018). Grover et al.(2019) use importance discriminator is trained to perform poorly in predicting an reweighting for mitigating model bias between pdata and outcome based on a protected attribute. Ryu et al.(2017) pθ. considers of race and gender identities as a Closest related is the proposal of Diesendruck et al.(2018) form of weak supervision for predicting other attributes on to use importance reweighting for learning generative mod- datasets of faces. While the end goal for the above works is els where training and test distributions differ, but ex- classification, our focus is on data generation in the presence plicit importance weights are provided for at least a sub- of dataset bias and we do not require explicit supervision set of the training examples. We consider a more realistic, for the protected attributes. weakly-supervised setting where we estimate the importance The most relevant prior works in data generation are Fair- weights using a small reference dataset. Finally, another GAN (Xu et al., 2018) and FairnessGAN (Sattigeri et al., related line of work in domain translation via generation 2019). The goal of both methods is to generate fair data- considers learning via multiple datasets (Zhu et al., 2017; points and their labels as a preprocessing technique. This Choi & Jang, 2018; Grover et al., 2020) and it would be allows for learning a useful downstream classifier and ob- interesting to consider issues due to dataset bias in those scures information about protected attributes. Again, these settings in future work. works are not directly comparable to ours as we do not as- sume explicit supervision regarding the protected attributes 6. Discussion during training, and our goal is fair generation given un- labelled biased datasets where the bias factors are latent. Our work presents an initial foray into the field of fair im- Another relevant work is DB-VAE (Amini et al., 2019), age generation with weak supervision, and we stress the which utilizes a VAE to learn the latent structure of sensitive need for caution in using our techniques and interpreting attributes, and in turn employs importance weighting based the empirical findings. For scaling our evaluation, we pro- on this structure to mitigate bias in downstream classifiers. posed metrics that relied on a pretrained attribute classifier Contrary to our work, these importance weights are used for inferring the bias in the generated data samples. The to directly sample (rare) data points with higher frequen- classifiers we considered are highly accurate on all sub- cies with the goal of training a classifier (e.g. as in a facial groups, but can have blind spots especially when evaluated detection system), as opposed to fair generation. on generated data. For future work, we would like to inves- tigate conducting human evaluations to mitigate such issues Importance reweighting. Reweighting datapoints is a during evaluation (Grgic-Hlaca et al., 2018). common algorithmic technique for problems such as dataset bias and class imbalance (Byrd & Lipton, 2018). It has As another case in point, our work calls for rethinking sam- often been used in the context of fair classification (Calders ple quality metrics for generative models in the presence et al., 2009), for example, (Kamiran & Calders, 2012) de- of dataset bias (Mitchell et al., 2019). On one hand, our tails reweighting as a way to remove discrimination without approach increases the diversity of generated samples in the relabeling instances. For reinforcement learning, Doroudi sense that the different subgroups are more balanced; at the et al.(2017) used an importance sampling approach for se- same time, however, variation across other image features lecting fair policies. There is also a body of work on fair decreases because the newly generated underrepresented clustering (Chierichetti et al., 2017; Backurs et al., 2019; samples are learned from a smaller dataset of underrepre- Bera et al., 2019; Schmidt et al., 2018) which ensure that the sented subgroups. Moreover, standard metrics such as FID Fair Generative Modeling via Weak Supervision even when evaluated with respect to a reference dataset, References could exhibit a relative preference for models trained on 23&me. The real issue: Diversity in genetics research. larger datasets with little or no bias correction to avoid even Retrieved from https://blog.23andme.com/ancestry/the- slight compromises on perceptual sample quality. real-issue-diversity-in-genetics-research/, 2016. More broadly, this work is yet another reminder that we must be mindful of the decisions made at each stage in Abebe, R., Barocas, S., Kleinberg, J., Levy, K., Raghavan, the development and deployment of ML systems (Abebe M., and Robinson, D. G. Roles for computing in so- et al., 2020). Factors such as the dataset used for train- cial change. In Proceedings of the 2020 Conference on ing (Gebru et al., 2018; Sheng et al., 2019; Jo & Gebru, Fairness, Accountability, and Transparency, pp. 252–260, 2020) or algorithmic decisions such as the loss function or 2020. evaluation metric (Hardt et al., 2016; Buolamwini & Gebru, 2018; Kim et al., 2018; Liu et al., 2018; Hashimoto et al., Adel, T., Valera, I., Ghahramani, Z., and Weller, A. One- 2018), among others, may have undesirable consequences. network adversarial fairness. 2019. Becoming more aware of these downstream impacts will Amini, A., Soleimany, A. P., Schwarting, W., Bhatia, S. N., help to mitigate the potentially discriminatory nature of our and Rus, D. Uncovering and mitigating algorithmic bias present-day systems (Kaeser-Chen et al., 2020). through learned latent structure. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 7. Conclusion pp. 289–295, 2019.

We considered the task of fair data generation given access Arora, S., Risteski, A., and Zhang, Y. Do gans learn the to a (potentially small) reference dataset and a large biased distribution? some theory and empirics. In International dataset. For data-efficient learning, we proposed an im- Conference on Learning Representations, 2018. portance weighted objective that corrects bias by reweight- ing the biased datapoints. These weights are estimated Azadi, S., Olsson, C., Darrell, T., Goodfellow, I., and Odena, by a binary classifier. Empirically, we showed that our A. Discriminator rejection sampling. arXiv preprint technique outperforms baselines by up to 34.6% on av- arXiv:1810.06758, 2018. erage in reducing dataset bias on CelebA without incur- ring a significant reduction in sample quality. We pro- Backurs, A., Indyk, P., Onak, K., Schieber, B., Vakilian, A., vide reference implementations in PyTorch (Paszke et al., and Wagner, T. Scalable fair clustering. arXiv preprint 2017), and the codebase for this work is open-sourced at arXiv:1902.03519, 2019. https://github.com/ermongroup/fairgen. Barocas, S. and Selbst, A. D. Big data’s disparate impact. In the future, it would be interesting to explore whether Calif. L. Rev., 104:671, 2016. even weaker forms of supervision would be possible for this task, e.g., when the biased dataset has a somewhat disjoint Barocas, S., Hardt, M., and Narayanan, A. Fairness and but related support from the small, reference dataset – this Machine Learning. fairmlbook.org, 2018. http:// would be highly reflective of the diverse data sources used www.fairmlbook.org. for training many current and upcoming large-scale ML systems (Ratner et al., 2017). Bera, S. K., Chakrabarty, D., and Negahbani, M. Fair algo- rithms for clustering. arXiv preprint arXiv:1901.02393, Acknowledgements 2019. We are thankful to Hima Lakkaraju, Daniel Levy, Mike Beutel, A., Chen, J., Zhao, Z., and Chi, E. H. Data decisions Wu, Chris Cundy, and Jiaming Song for insightful discus- and theoretical implications when adversarially learning sions and feedback. KC is supported by the NSF GRFP, fair representations. arXiv preprint arXiv:1707.00075, Qualcomm Innovation Fellowship, and Stanford Graduate 2017. Fellowship, and AG is supported by the MSR Ph.D. fel- Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Joze- lowship, Stanford Data Science scholarship, and Lieberman fowicz, R., and Bengio, S. Generating sentences from fellowship. This research was funded by NSF (#1651565, a continuous space. arXiv preprint arXiv:1511.06349, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR 2015. (FA9550-19-1-0024), and Amazon AWS. Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. Fair Generative Modeling via Weak Supervision

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, du Pin Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., K. N., and Varshney, K. R. Data pre-processing for dis- Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., crimination prevention: Information-theoretic optimiza- Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, tion and analysis. IEEE Journal of Selected Topics in J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Signal Processing, 12(5):1106–1119, 2018. Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, models are few-shot learners. 2020. R. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, Buolamwini, J. and Gebru, T. Gender shades: Intersec- pp. 214–226. ACM, 2012. tional accuracy disparities in commercial gender classi- Edwards, H. and Storkey, A. Censoring representations with fication. In Conference on fairness, accountability and an adversary. arXiv preprint arXiv:1511.05897, 2015. transparency, pp. 77–91, 2018. Friedman, J., Hastie, T., and Tibshirani, R. The elements of Burgess, C. and Kim, H. 3d shapes dataset. statistical learning, volume 1. Springer series in statistics https://github.com/deepmind/3dshapes-dataset/, 2018. New York, NY, USA:, 2001.

Byrd, J. and Lipton, Z. C. What is the effect of im- Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., portance weighting in ? arXiv preprint Wallach, H., Daumee´ III, H., and Crawford, K. Datasheets arXiv:1812.03372, 2018. for datasets. arXiv preprint arXiv:1803.09010, 2018.

Calders, T., Kamiran, F., and Pechenizkiy, M. Building Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., classifiers with independency constraints. In 2009 IEEE Warde-Farley, D., Ozair, S., Courville, A., and Bengio, International Conference on Data Mining Workshops, pp. Y. Generative adversarial nets. In Advances in neural 13–18. IEEE, 2009. information processing systems, pp. 2672–2680, 2014. Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf,¨ B., Chierichetti, F., Kumar, R., Lattanzi, S., and Vassilvitskii, and Smola, A. J. A kernel method for the two-sample- S. Fair clustering through fairlets. In Advances in Neural problem. In Advances in neural information processing Information Processing Systems, pp. 5029–5037, 2017. systems, pp. 513–520, 2007. Choi, H. and Jang, E. Generative ensembles for robust Grgic-Hlaca, N., Redmiles, E. M., Gummadi, K. P., and anomaly detection. arXiv preprint arXiv:1810.01392, Weller, A. Human perceptions of fairness in algorithmic 2018. decision making: A case study of criminal risk prediction. In Proceedings of the 2018 World Wide Web Conference Danihelka, I., Lakshminarayanan, B., Uria, B., Wierstra, on World Wide Web, pp. 903–912, 2018. D., and Dayan, P. Comparison of maximum likelihood and gan-based training of real nvps. arXiv preprint Grover, A. and Ermon, S. Boosted generative models. In arXiv:1705.05263, 2017. Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- Grover, A., Dhar, M., and Ermon, S. Flow-gan: Com- guage understanding. arXiv preprint arXiv:1810.04805, bining maximum likelihood and adversarial learning in 2018. generative models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Diesendruck, M., Elenberg, E. R., Sen, R., Cole, Grover, A., Song, J., Agarwal, A., Tran, K., Kapoor, A., G. W., Shakkottai, S., and Williamson, S. A. Impor- Horvitz, E., and Ermon, S. Bias correction of learned gen- tance weighted generative networks. arXiv preprint erative models using likelihood-free importance weight- arXiv:1806.02512, 2018. ing. In NeurIPS, 2019. Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear Grover, A., Chute, C., Shu, R., Cao, Z., and Ermon, S. independent components estimation. arXiv preprint Alignflow: Cycle consistent learning from multiple do- arXiv:1410.8516, 2014. mains via normalizing flows. In AAAI, pp. 4028–4035, 2020. Doroudi, S., Thomas, P. S., and Brunskill, E. Importance sampling for fair policy selection. Grantee Submission, Gulrajani, I., Raffel, C., and Metz, L. Towards gan bench- 2017. marks which require generalization. 2018. Fair Generative Modeling via Weak Supervision

Hardt, M., Price, E., and Srebro, N. Equality of opportunity Kingma, D. P. and Welling, M. Auto-encoding variational in . In Advances in neural information bayes. arXiv preprint arXiv:1312.6114, 2013. processing systems, pp. 3315–3323, 2016. Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Poczos,´ Hashimoto, T. B., Srivastava, M., Namkoong, H., and Liang, B. Mmd gan: Towards deeper understanding of moment P. Fairness without demographics in repeated loss mini- matching network. In Advances in Neural Information mization. arXiv preprint arXiv:1806.08010, 2018. Processing Systems, pp. 2203–2213, 2017. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Lim, J. H. and Ye, J. C. Geometric gan. arXiv preprint ing for image recognition. In Proceedings of the IEEE arXiv:1705.02894, 2017. conference on computer vision and , pp. 770–778, 2016. Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. arXiv preprint Heidari, H., Ferrari, C., Gummadi, K., and Krause, A. Fair- arXiv:1803.04383, 2018. ness behind a veil of ignorance: A welfare analysis for automated decision making. In Advances in Neural Infor- Lopez-Paz, D. and Oquab, M. Revisiting classifier two- mation Processing Systems, pp. 1265–1276, 2018. sample tests. arXiv preprint arXiv:1610.06545, 2016. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Louizos, C., Swersky, K., Li, Y., Welling, M., and Zemel, Hochreiter, S. Gans trained by a two time-scale update R. The variational fair autoencoder. arXiv preprint rule converge to a local nash equilibrium. In Advances in arXiv:1511.00830, 2015. Neural Information Processing Systems , pp. 6626–6637, Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, 2017. L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. Hong, E. 23andme has a problem when it comes to ances- Model cards for model reporting. In Proceedings of the try reports for people of color. Quartz. Retrieved from conference on fairness, accountability, and transparency, https://qz. com/765879/23andme-has-a-race-problem- pp. 220–229, 2019. when-it-comes-to-ancestryreports-for-non-whites, 2016. Mohamed, S. and Lakshminarayanan, B. Learn- Horvitz, D. G. and Thompson, D. J. A generalization of sam- ing in implicit generative models. arXiv preprint pling without replacement from a finite universe. Journal arXiv:1610.03483, 2016. of the American statistical Association, 1952. Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training Huang, C.-Z. A., Cooijmnas, T., Roberts, A., Courville, A., generative neural samplers using variational divergence and Eck, D. Counterpoint by . ISMIR, 2017. minimization. In Advances in Neural Information Pro- cessing Systems, pp. 271–279, 2016. Im, D. J., Ma, H., Taylor, G., and Branson, K. Quantitatively evaluating gans with divergences proposed for training. Odena, A., Olah, C., and Shlens, J. Conditional image arXiv preprint arXiv:1803.01045, 2018. synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning- Jo, E. S. and Gebru, T. Lessons from archives: strategies Volume 70, pp. 2642–2651. JMLR. org, 2017. for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Ac- Oord, A. v. d., Li, Y., Babuschkin, I., Simonyan, K., countability, and Transparency, pp. 306–316, 2020. Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lock- hart, E., Cobo, L. C., Stimberg, F., et al. Parallel Kaeser-Chen, C., Dubois, E., Schu¨ur,¨ F., and Moss, E. : Fast high-fidelity speech synthesis. arXiv Positionality-aware machine learning: translation tuto- preprint arXiv:1711.10433, 2017. rial. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 704–704, 2020. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, Kamiran, F. and Calders, T. Data preprocessing techniques A. Automatic differentiation in . 2017. for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012. Podesta, J., Pritzker, P., Moniz, E., Holdren, J., and Zients, J. Big data: seizing opportunities, preserving values. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Executive Office of the President, The White House, 2014. Viegas, F., et al. Interpretability beyond feature attribu- tion: Quantitative testing with concept activation vectors Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and (tcav). In International conference on machine learning, Sutskever, I. Language models are unsupervised multitask pp. 2668–2677, 2018. learners. 2019. Fair Generative Modeling via Weak Supervision

Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Tran, D., Ranganath, R., and Blei, D. Hierarchical implicit Re,´ C. Snorkel: Rapid training data creation with weak models and likelihood-free variational inference. In Ad- supervision. Proceedings of the VLDB Endowment, 11 vances in Neural Information Processing Systems, pp. (3):269–282, 2017. 5523–5533, 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Turner, R., Hung, J., Saatci, Y., and Yosinski, J. Metropolis- and approximate inference in deep gen- hastings generative adversarial networks. arXiv preprint erative models. arXiv preprint arXiv:1401.4082, 2014. arXiv:1811.11357, 2018.

Rosca, M., Lakshminarayanan, B., Warde-Farley, D., Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, and Mohamed, S. Variational approaches for auto- L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- encoding generative adversarial networks. arXiv preprint tion is all you need. In Advances in neural information arXiv:1706.04987, 2017. processing systems, pp. 5998–6008, 2017. Xu, D., Yuan, S., Zhang, L., and Wu, X. Fairgan: Fairness- Ryu, H. J., Adam, H., and Mitchell, M. Inclusivefacenet: aware generative adversarial networks. In 2018 IEEE Improving face attribute detection with race and gender International Conference on Big Data (Big Data), pp. arXiv preprint arXiv:1712.00193 diversity. , 2017. 570–575. IEEE, 2018. Sattigeri, P., Hoffman, S. C., Chenthamarakshan, V., and Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Varshney, K. R. Fairness gan: Generating datasets with Learning fair representations. In International Confer- fairness properties using a generative adversarial network. ence on Machine Learning, pp. 325–333, 2013. In Proc. ICLR Workshop Safe Mach. Learn, volume 2, 2019. Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adver- Schmidt, M., Schwiegelshohn, C., and Sohler, C. Fair core- sarial networks. arXiv preprint, 2017. sets and streaming algorithms for fair k-means clustering. arXiv preprint arXiv:1812.10854, 2018. Ziwei Liu, Ping Luo, X. W. and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. The Conference on Computer Vision (ICCV), 2015. woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, 2019.

Song, J., Kalluri, P., Grover, A., Zhao, S., and Ermon, S. Learning controllable fair representations. arXiv preprint arXiv:1812.04218, 2018.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation in machine learning. Cambridge University Press, 2012.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vi- sion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.

Tao, C., Chen, L., Henao, R., Feng, J., and Duke, L. C. Chi- square generative adversarial network. In International Conference on Machine Learning, pp. 4894–4903, 2018.

Tommasi, T., Patricia, N., Caputo, B., and Tuytelaars, T. A deeper look at dataset bias. In Domain Adaptation in Computer Vision Applications, pp. 37–55. Springer, 2017.

Torralba, A., Efros, A. A., et al. Unbiased look at dataset bias. In CVPR, volume 1, pp. 7. Citeseer, 2011. Fair Generative Modeling via Weak Supervision Supplementary Material A. Proof of Theorem1

0 0 Proof. Since pbias(x|z = k) and pbias(x|z = k ) have disjoint supports for k 6= k , we know that for all x, there exists a deterministic mapping f : X → Z such that pbias(x|z = f(x)) > 0. Further, for all x˜ 6∈ f −1(z):

pbias(x˜|z = f(x)) = 0; (9)

pref (x˜|z = f(x)) = 0. (10)

Combining Eqs.9,10 above with the assumption in Eq.7, we can simplify the density ratios as: R pbias(x) z pbias(x|z)pbias(z)dz = R (11) pref (x) z pref (x|z)pref (z)dz p (x|f(x))p (f(x)) = bias bias (using Eqs.9,10) (12) pref (x|f(x))pref (f(x)) p (f(x)) = bias (using Eq.7) (13) pref (f(x)) = b(f(x)). (14)

From Eq.5 and Eq. 11, the Bayes optimal classifier c∗ can hence be expressed as: 1 c∗(Y = 1|x) = . (15) γb(f(x)) + 1

The optimal cross-entropy loss of a binary classifier c for density ratio estimation (DRE) can then be expressed as: 1 γ NCE(c∗) = [log c∗(Y = 1|x)] + [log c∗(Y = 0|x)] (16) γ + 1Epref (x) γ + 1Epbias(x)  1  γ  γb(f(x))  = log + log (using Eq. 15) (17) Epref (x) γb(f(x)) + 1 γ + 1Epbias(x) γb(f(x)) + 1 1  1  γ  γb(f(x))  = log + log (18) γ + 1Epref (z)Epref (x|z) γb(f(x)) + 1 γ + 1Epbias(z)Epbias(x|z) γb(f(x)) + 1 1  1  γ  γb(z)  = log + log (using Eqs.9,10) . (19) γ + 1Epref (z) γb(z) + 1 γ + 1Epbias(z) γb(z) + 1

B. Dataset Details B.1. Dataset Construction Procedure We construct such dataset splits from the full CelebA training set using the following procedure. We initially fix our dataset size to be roughly 135K out of the total 162K based on the total number of females present in the data. Then for each level of bias, we partition 1/4 of males and 1/4 of females into Dref to achieve the 50-50 ratio. The remaining number of examples are used for Dbias, where the number of males and females are adjusted to match the desired level of bias (e.g. 0.9). Finally at each level of reference dataset size perc, we discard the appropriate fraction of datapoints from both the male and female category in Dref . For example, for perc = 0.5, we discard half the number of females and half the number of males from Dref .

B.2. FID Calculation As noted Sections 2.3 and 6, the FID metric may exhibit a relative preference for models trained on larger datasets in order to maximize perceptual sample quality, at the expense of propagating or amplifying existing dataset bias. In order to Fair Generative Modeling via Weak Supervision obtain an estimate of sample quality that would also incorporate a notion of fairness across sensitive attribute classes, we pre-computed the relevant FID statistics on a ”balanced” construction of the CelebA dataset that matches our reference dataset pref . That is, we used all train/validation/test splits of the data such that: (1) for single-attribute, there were 50-50 portions of males and females; and (2) for multi-attribute, there were even proportions of examples across all 4 classes (females with black hair, females without black hair, males with black hair, males without black hair). We report ”balanced” FID numbers on these pre-computed statistics throughout the paper.

C. Architecture and Hyperparameter Configurations We used PyTorch (Paszke et al., 2017) for all our experiments. Our overall experimental framework involved three different kinds of models which we describe below.

C.1. Attribute Classifier We use the same architecture and hyperparameters for both the single- and multi-attribute classifiers. Both are variants of ResNet-18 where the output number of classes correspond to the dataset split (e.g. 2 classes for single-attribute, 4 classes for the multi-attribute experiment).

Architecture. We provide the architectural details in Table2 below:

Name Component conv1 7 × 7 conv, 64 filters. stride 2 Residual Block 1 3 × 3 max pool, stride 2 3 × 3 conv, 128 filters Residual Block 2 × 2 3 × 3 conv, 128 filters 3 × 3 conv, 256 filters Residual Block 3 × 2 3 × 3 conv, 256 filters 3 × 3 conv, 512 filters Residual Block 4 × 2 3 × 3 conv, 512 filters Output Layer 7 × 7 average pool stride 1, fully-connected, softmax

Table 2. ResNet-18 architecture adapted for attribute classifier.

Hyperparameters. During training, we use a batch size of 64 and the Adam optimizer with learning rate = 0.001. The classifiers learn relatively quickly for both scenarios and we only needed to train for 10 epochs. We used early stopping with the validation set in CelebA to determine the best model to use for downstream evaluation.

C.2. Density Ratio Classifier Architecture. We provide the architectural details in Table2.

Name Component conv1 7 × 7 conv, 64 filters. stride 2 Residual Block 1 3 × 3 max pool, stride 2 3 × 3 conv, 128 filters Residual Block 2 × 2 3 × 3 conv, 128 filters 3 × 3 conv, 256 filters Residual Block 3 × 2 3 × 3 conv, 256 filters 3 × 3 conv, 512 filters Residual Block 4 × 2 3 × 3 conv, 512 filters Output Layer 7 × 7 average pool stride 1, fully-connected, softmax

Table 3. ResNet-18 architecture adapted for attribute classifier. Fair Generative Modeling via Weak Supervision

Hyperparameters. We also use a batch size of 64, the Adam optimizer with learning rate = 0.0001, and a total of 15 epochs to train the density ratio estimate classifier.

Experimental Details. We note a few steps we had to take during the training and validation procedure. Because of the imbalance in both (a) unbalanced/balanced dataset sizes and (b) gender ratios, we found that a naive training procedure encouraged the classifier to predict all data points as belonging to the biased, unbalanced dataset. To prevent this phenomenon from occuring, two minor modifications were necessary:

1. We balance the distribution between the two datasets in each minibatch: that is, we ensure that the classifier sees equal numbers of data points from the balanced (y = 1) and unbalanced (y = 0) datasets for each batch. This provides enough signal for the classifier to learn meaningful density ratios, as opposed to a trivial mapping of all points to the larger dataset. 2. We apply a similar balancing technique when testing against the validation set. However, instead of balancing the minibatch, we weight the contribution of the losses from the balanced and unbalanced datasets. Specifically, the loss is computed as: 1 acc acc  L = pos + neg 2 npos nneg where the subscript pos denotes examples from the balanced dataset (y = 1) and neg denote examples from the unbalanced dataset (y = 0).

C.3. BigGAN Architecture. The architectural details for the BigGAN are provided in Table4.

Generator Discriminator 1 × 1 × 2ch Noise 64 × 64 × 3 Image Linear 1 × 1 × 16ch → 1 × 1 × 16ch ResBlock down 1ch → 2ch ResBlock up 16ch → 16ch Non-Local Block (64 × 64) ResBlock up 16ch → 8ch ResBlock down 2ch → 4ch ResBlock up 8ch → 4ch ResBlock down 4ch → 8ch ResBlock up 4ch → 2ch ResBlock down 8ch → 16ch Non-Local Block (64 × 64) ResBlock down 16ch → 16ch ResBlock up 4ch → 2ch ResBlock 16ch → 16ch BatchNorm, ReLU, 3 × 3 Conv 1ch → 3 ReLU, Global sum pooling Tanh Linear → 1

Table 4. Architecture for the generator and discriminator. Notation: ch refers to the channel width multiplier, which is 64 for 64 × 64 CelebA images. ResBlock up refers to a Generator Residual Block in which the input is passed through a ReLU activation followed by two 3 × 3 convolutional layers with a ReLU activation in between. ResBlock down refers to a Discriminator Residual Block in which the input is passed through two 3 × 3 convolution layers with a ReLU activation in between, and then downsampled. Upsampling is performed via nearest neighbor interpolation, whereas downsampling is performed via mean pooling. “ResBlock up/down n → m” indicates a ResBlock with n input channels and m output channels.

Hyperparameters. We sweep over a batch size of {16, 32, 64, 128}, and the Adam optimizer with learning rate = 0.0002, and β1 = 0, β2 = 0.99. We train the model by taking 4 discriminator gradient steps per generator step. Because the BigGAN was originally designed for scaling up class-conditional image generation, we fix all conditioning labels for the unconditional baselines (imp-weight, equi-weight) to the zero vector. Additionally, we investigate the role of flattening in the density ratios used to train the generative model. As in (Grover et al., 2019), flattening the density ratios via a power scaling parameter α ≥ 0 is defined as:

T 1 X ≈ w(x )α`(x , θ) Ex∼pref [`(x,θ)] T i i i=1 Fair Generative Modeling via Weak Supervision where xi ∼ pbias. We perform a hyperparameter sweep over α = {0.5, 1.0, 1.5}, while noting that α = 0 is equivalent to the equi-weight baseline (no reweighting).

D. Density Ratio Classifier Analysis

(a) bias=0.9, perc=1.0 (b) bias=0.8, perc=1.0 (c) multi, perc=1.0

(d) bias=0.9, perc=0.5 (e) bias=0.8, perc=0.5 (f) multi, perc=0.5

(g) bias=0.9, perc=0.25 (h) bias=0.8, perc=0.25 (i) Multi perc=0.25

(j) bias=0.9, perc=0.1 (k) bias=0.8, perc=0.1 (l) Multi perc=0.1

Figure 5. Calibration curves

In Figure5, we show the calibration curves for the density ratio classifiers for each of the Dref dataset sizes across all levels of bias. As evident from the plots, most classifiers are already calibrated and did not require any post-training recalibration.

E. Fairness Discrepancy Metric In this section, we motivate the fairness discrepancy metric and elaborate upon its construction. Recall from Equation2 that the metric is as follows for the sensitive attributes u:

f(pref , pθ) = |Epref [p(u|x)] − Epθ [p(u|x)]|2.

To gain further insight into what the metric is capturing, we rewrite the joint distribution of the sensitive attributes u and our data x: (1) pref (u, x) = p(u|x)pref (x) and (2) pθ(u, x) = p(u|x)pθ(x). Then, marginalizing out x and only looking at R R the distribution of u, we get that p(u) = p(u, x)dx = p(u|x)p(x)dx = Ep(x)p(u|x). Thus the fairness discrepancy metric is |pref (u) − pθ(u)|2. Fair Generative Modeling via Weak Supervision

(a) Biased and Reference Distributions (b) Density Ratios

Figure 6. (a) Comparison between two biased (non-uniform weighted mixture, shown in blue) and reference (equi-weighted Gaussian mixture, shown in red). After the optimal density ratios are estimated using a two-layer MLP, we observe that the estimated density ratios are extremely similar to the ratios output by the Bayes optimal classifier, as desired.

This derivation is informative because it allows us to relate the fairness discrepancy metric to the behavior of the (oracle) attribute classifier. Suppose we use a deterministic classifier p(u|x) as in the paper: that is, we threshold at 0.5 to label all examples with p(u|x) > 0.5 as u = 1 (e.g. male), and p(u|x) ≤ 0.5 as u = 0 (e.g. female). In this setting, the fairness discrepancy metric simply becomes the `2 distance in proportions of different populations between the true (reference) dataset and the generated examples. It is easy to see that if we use a probabilistic classifier (without thresholding), we can obtain similar distributional discrepancies between the true (reference) data distribution and the distribution learned by pθ such as the empirical KL.

F. Additional Results F.1. Toy Example with Gaussian Mixture Models We demonstrate the benefits of our reweighting technique through a toy Gaussian mixture model example. In Figure6(a), the reference distribution is shown in blue and the biased distribution in red. The blue distribution is an equi-weighted mixture of 2 Gaussians (reference), while the red distribution is a non-uniform weighted mixture of 2 Gaussians (biased). The weights are 0.9 and 0.1 for the two Gaussians in the biased case. We trained a two layer multi-layer perceptron (MLP) (with tanh activations) to estimate density ratios based on 1000 samples drawn from the two distributions. We then compare the Bayes optimal and estimated density ratios in Figure6(b), and observe that the estimated density ratios closely trace the ratios output by the Bayes optimal classifier.

F.2. Shapes3D Dataset For this experiment, we used the Shapes3D dataset (Burgess & Kim, 2018) which is comprised of 480,000 images of shapes with six underlying attributes. We chose a random attribute (floor color), restricted it to two possible instantiations (red vs. blue), and then applied Algorithm 1 in the main text for bias=0.9 for this setting. Training on the large biased dataset (containing excess of red floors) induces an average fairness discrepancy of 0.468 as shown in Figure7(a). In contrast, applying the importance-weighting correction on the large biased dataset enabled us to train models that yielded an average fairness discrepancy of 0.002 as shown in Figure7(b).

F.3. Downstream Classification Task We note that although it is difficult to directly compare our model to supervised baselines such as FairGAN (Xu et al., 2018) and FairnessGAN (Sattigeri et al., 2019) due to the unsupervised nature of our work, we conduct further evaluations on a Fair Generative Modeling via Weak Supervision

(a) Baseline samples (b) Samples after reweighting

Figure 7. Results from the Shapes3D dataset. After restricting the possible floor colors to red or blue and using a biased dataset of bias=0.9, we find that the samples obtained after importance reweighting (b) are considerably more balanced than those without reweighting (a), as desired. relevant downstream task classification task, adapted to a fairness setting. In this task, we augment a biased dataset (165K exmaples) with a ”fair” dataset (135K examples) generated by a pre-trained GAN to use for training a classifier, then evaluate the classifier’s performance on a held-out dataset of true examples. We train a conditional GAN using the AC-GAN objective (Odena et al., 2017), where the conditioning is on an arbitrary downstream attribute of interest (e.g., we consider the attractiveness attribute of CelebA as in (Sattigeri et al., 2019)). Our goal is to learn a fair classifier trained to predict the attribute of interest in a way that is fair with respect to gender, the sensitive attribute.

As an evaluation metric, we use the demographic parity distance (∆dp), denoted as the absolute difference in demographic parity between two classifiers f and g: ∆dp = |fdp − gdp|

We consider 2 AC-GAN variants: (1) equi-weight trained on Dbias ∪ Dref ; and (2) imp-weight, which reweights the loss by the density ratio estimates. The classifier is trained on both real and generated images for both AC-GAN variants, with the labels given by the conditioned attractiveness values for the respective generations. The classifier is then asked to predict attractiveness for the CelebA test set. As shown in Table5, we find that the classifier trained on both real data and synthetic data generated by our imp-weight AC-GAN achieved a much lower ∆dp than the equi-weight baseline, demonstrating that our method achieves a higher demographic parity with respect to the sensitive attribute, despite the fact that we did not explicitly use labels during training.

Model Accuracy NLL ∆dp Baseline classifier, no data augmentation 79% 0.7964 0.038 equi-weight 79% 0.7902 0.032 imp-weight (ours) 75% 0.7564 0.002

Table 5. For the CelebA dataset, classifier accuracy, negative log-likelihood, and ∆dp across bias = 0.9 and perc=1.0 on the downstream classification task. Our importance-weighting method learns a fair classifier that achieves a lower ∆dp, as desired, albeit with a slight reduction in accuracy.

F.4. Single-Attribute Experiment The results for the single-attribute split for bias=0.8 are shown in Figure8. Fair Generative Modeling via Weak Supervision

(a) Samples generated via importance reweighting. Faces above orange line classified as female (55/100) while rest as male.

40 imp-weight 35 equi-weight 30 conditional 25

20 FID 15

10

5

0 0.1 0.25 0.5 1.0 Ratio of unbiased to biased dataset sizes

(b) Fairness Discrepancy (c) FID

Figure 8. Single Attribute Dataset Bias Mitigation for bias=0.8. Standard error in (b) and (c) over 10 independent evaluation sets of 10,000 samples each drawn from the models. Lower fairness discrepancy and FID is better. We find that on average, imp-weight outperforms the equi-weight baseline by 23.9% and the conditional baseline by 12.2% across all reference dataset sizes for bias mitigation.

G. Additional generated samples Additional samples for other experimental configuration are displayed in the following pages. Fair Generative Modeling via Weak Supervision

(a) equi-weight

(b) conditional

(c) imp-weight

Figure 9. Additional samples of bias=0.9, across different methods. All samples shown are from the scenario where |Dref | = |Dbias|. Fair Generative Modeling via Weak Supervision

(a) equi-weight

(b) conditional

(c) imp-weight

Figure 10. Additional samples of bias=0.8, across different methods. All samples shown are from the scenario where |Dref | = |Dbias|. Fair Generative Modeling via Weak Supervision

(a) equi-weight

(b) conditional

(c) imp-weight

Figure 11. Additional samples of the multi-attribute experiment, across different methods. All samples shown are from the scenario where |Dref | = |Dbias|.