CNN-Generated Images Are Surprisingly Easy to Spot... for Now

CNN-Generated Images Are Surprisingly Easy to Spot... for Now

CNN-generated images are surprisingly easy to spot... for now Sheng-Yu Wang1 Oliver Wang2 Richard Zhang2 Andrew Owens1,3 Alexei A. Efros1 UC Berkeley1 Adobe Research2 University of Michigan3 synthetic real ProGAN [19] StyleGAN [20] BigGAN [7] CycleGAN [48] StarGAN [10] GauGAN [29] CRN [9] IMLE [23] SITD [8] Super-res. [13] Deepfakes [33] Figure 1: Are CNN-generated images hard to distinguish from real images? We show that a classifier trained to detect images generated by only one CNN (ProGAN, far left) can detect those generated by many other models (remaining columns). Our code and models are available at https://peterwang512.github.io/CNNDetection/. Abstract are fake [14]. This issue has started to play a significant role in global politics; in one case a video of the president of In this work we ask whether it is possible to create Gabon that was claimed by opposition to be fake was one a “universal” detector for telling apart real images from factor leading to a failed coup d’etat∗. Much of this con- these generated by a CNN, regardless of architecture or cern has been directed at specific manipulation techniques, dataset used. To test this, we collect a dataset consisting such as “deepfake”-style face replacement [2], and photo- of fake images generated by 11 different CNN-based im- realistic synthetic humans [20]. However, these methods age generator models, chosen to span the space of com- represent only two instances of a broader set of techniques: monly used architectures today (ProGAN, StyleGAN, Big- image synthesis via convolutional neural networks (CNNs). GAN, CycleGAN, StarGAN, GauGAN, DeepFakes, cas- Our goal in this work is to find a general image forensics caded refinement networks, implicit maximum likelihood es- approach for detecting CNN-generated imagery. timation, second-order attention super-resolution, seeing- Detecting whether an image was generated by a spe- in-the-dark). We demonstrate that, with careful pre- and cific synthesis technique is relatively straightforward — just post-processing and data augmentation, a standard image train a classifier on a dataset consisting of real images and classifier trained on only one specific CNN generator (Pro- images synthesized by the technique in question. However, GAN) is able to generalize surprisingly well to unseen ar- such an approach will likely be tied to the dataset used in chitectures, datasets, and training methods (including the image generation (e.g. faces), and, due to dataset bias [35], just released StyleGAN2 [21]). Our findings suggest the might not generalize when tested on new data (e.g. cars). intriguing possibility that today’s CNN-generated images Even worse, the technique-specific detector is likely to soon share some common systematic flaws, preventing them from become ineffective as generation methods evolve and the achieving realistic image synthesis. technique it was trained on becomes obsolete. It is natural, therefore, to ask whether today’s CNN- generated images contain common artifacts, e.g., some kind 1. Introduction of detectable CNN fingerprints, that would allow a classi- fier to generalize to an entire family of generation meth- Recent rapid advances in deep image synthesis tech- ods, rather than a single one. Unfortunately, prior work niques, such as Generative Adversarial Networks (GANs), has reported generalization to be a significant problem for have generated a huge amount of public interest and con- cern, as people worry that we are entering a world where it ∗https://www.motherjones.com/politics/2019/03/ will be impossible to tell which images are real and which deepfake-gabon-ali-bongo/ 18695 image forensics approaches. For example, several recent 2. Related work works [44, 12, 37] observe that that classifiers trained on images produced by one GAN architecture perform poorly Detecting CNN-based Manipulations Several recent when tested on others, and in many cases they also fail works have addressed the problem of detecting images gen- to generalize when only the dataset (and not the architec- erated by CNNs. Rossler¨ et al.[33] evaluated methods ture or task) is changed [44]. This makes sense, as image for detecting face manipulation techniques, including CNN- generation methods are highly varied: they use different based face and mouth replacement methods. While they datasets, network architectures, loss functions, and image showed that simple classifiers could detect fakes generated pre-processing. by the same model, they did not study generalization be- tween models or datasets. Marra et al.[24] likewise showed In this paper, we show that, contrary to this current un- that simple classifiers can detect images created by an image derstanding, classifiers trained to detect CNN-generated im- translation network [17], but did not consider cross-model ages can exhibit a surprising amount of generalization abil- transfer. ity across datasets, architectures, and tasks. We follow con- Recently, Cozzolino et al.[12] found that forensics clas- ventions and train our classifiers in a straightforward man- sifiers transferred poorly between models, often obtaining ner, by generating a large number of fake images using a near-chance performance. They propose a new represen- single CNN model (we use ProGAN, a high-performing un- tation learning method, based on autoencoders, to improve conditional GAN model [19]), and train a binary classifier transfer performance in zero- and low-shot training regimes to detect fakes, using the model’s real training images as for a variety of generation methods. While their ultimate negative examples. goal is similar to ours, they take an orthogonal approach. To evaluate our model, we create a new dataset of CNN- They focus on new learning methods for improving transfer generated images, the ForenSynths dataset, consisting of learning, and apply them to a diverse assortment of models synthesized images from 11 models, that range from from (including both CNN and non-CNN). In contrast, we em- unconditional image generation methods, such as Style- pirically study the performance of simple “baseline” clas- GAN [20], to super-resolution methods [13], and deep- sifiers under different training and testing conditions for fakes [33]. Each model is trained on a different image CNN-based image generation. Zhang et al.[44] finds that dataset appropriate for its specific task. We have also classifiers generalize poorly between GAN models. They continued evaluating our detector on models that were re- propose a method called AutoGAN for generating images leased after our paper was originally written, finding that that contain the upsampling artifacts common in GAN ar- it works out-of-the-box on the very recent unconditional chitectures, and test it on two types of GANs. Other work GAN, StyleGAN2 [21]. has proposed to detect GAN images using hand-crafted co- Underneath the apparent simplicity of this approach, we occurrence features [26], or by anomaly detection models have found that there are a number of subtle challenges built on pretrained face detectors [37]. Researchers have which we study through a set of experiments and a new also proposed methods for identifying which, of several, dataset of trained image generation models. We find that known GANs generated a given image [25, 41]. data augmentation, in the form of common image post- processing operations, is critical for generalization, even Image forensics Researchers have proposed a variety of when the target images are not post-processed themselves. methods for detecting more traditional manipulation tech- We also find that diversity of training images matters; large niques, such as those made by image editing tools. Early datasets sampled from CNN synthesis methods lead to clas- work focused on hand-crafted cues [14] such as com- sifiers that outperform those trained on smaller datasets, to pression artifacts [3], resampling [31], or physical scene a point. Finally, it is critical to examine the effect of post- constraints [27]. More recently, researchers have applied processing on the model’s generalization ability which of- learning-based methods to these problems [45, 16, 11, 32, ten occur downstream of image creation (e.g., during stor- 38]. This line of work has found, like us, that simple, su- age and distribution). We show that when the correct steps pervised classifiers are often effective at detecting manipu- are taken, classifiers are indeed robust to common opera- lations [45, 38]. tions such as JPEG compression, blurring, and resizing. Artifacts from CNN-based Generators Researchers In summary, our main contributions are: 1) we show that have shown, recently, that common CNN designs contain forensics models trained on CNN-generated images exhibit artifacts that reduce their representational power. Much of a surprising amount of generalization to other CNN synthe- this work has focused on the way networks perform upam- sis methods; 2) we propose a new dataset and evaluation pling and downsampling. A well-known example of such metric for detecting CNN-generated images; 3) we exper- an artifact is the checkerboard artifact produced by decon- imentally analyze the factors that account for cross-model volutional layers [28]. Azulay and Weiss [4] showed convo- generalization. lutional networks ignore the classical sampling theorem and 8696 that strided convolutions therefore reduce translation invari- Family Method Image Source # Images ance, and Zhang [43] improved translation invariance by re- ProGAN [19] LSUN 8.0k Unconditional ducing aliasing in these layers. Very recently, Bau et al.[5] GAN StyleGAN [20] LSUN 12.0k suggested that GANs have limited generation capacity, and BigGAN [7] ImageNet 4.0k CycleGAN [48] Style/object transfer 2.6k analyzed the image structures that a pretrained GAN is un- Conditional able to produce. GAN StarGAN [10] CelebA 4.0k GauGAN [29] COCO 10.0k Perceptual CRN [9] GTA 12.8k 3. A dataset of CNN-based generation models loss IMLE [23] GTA 12.8k Low-level SITD [8] Raw camera 360 To study the transferability of classifiers trained to detect vision SAN [13] Standard SR benchmark 440 CNN-generated images, we collected a dataset of images Deepfake FaceForensics++ [33] Videos of faces 5.4k created from a variety of CNN models.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us