<<

1

Deep Learning for Creation and Detection: A Survey Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Cuong M. Nguyen, Dung Nguyen, Duc Thanh Nguyen, Saeid Nahavandi, Fellow, IEEE

Abstract— has been successfully applied I.INTRODUCTION to solve various complex problems ranging from big data analytics to and human-level control. Deep In a narrow definition, deepfakes (stemming from learning advances however have also been employed to cre- “deep learning” and “fake”) are created by techniques ate software that can cause threats to privacy, democracy that can superimpose face images of a target person onto and national security. One of those deep learning-powered a video of a source person to make a video of the target applications recently emerged is . Deepfake al- gorithms can create fake images and videos that humans person doing or saying things the source person does. cannot distinguish them from authentic ones. The proposal This constitutes a category of deepfakes, namely face- of technologies that can automatically detect and assess the swap. In a broader definition, deepfakes are artificial integrity of digital visual media is therefore indispensable. intelligence-synthesized content that can also fall into This paper presents a survey of used to create two other categories, i.e., lip-sync and puppet-master. deepfakes and, more importantly, methods proposed to Lip-sync deepfakes refer to videos that are modified to detect deepfakes in the literature to date. We present make the mouth movements consistent with an audio extensive discussions on challenges, research trends and recording. Puppet-master deepfakes include videos of a directions related to deepfake technologies. By reviewing target person (puppet) who is animated following the the background of deepfakes and state-of-the-art deepfake detection methods, this study provides a comprehensive facial expressions, eye and head movements of another overview of deepfake techniques and facilitates the devel- person (master) sitting in front of a camera [1]. opment of new and more robust methods to deal with the While some deepfakes can be created by traditional increasingly challenging deepfakes. visual effects or computer-graphics approaches, the re- Impact Statement—This survey provides a timely cent common underlying mechanism for deepfake cre- overview of deepfake creation and detection methods ation is deep learning models such as and presents a broad discussion of challenges, potential and generative adversarial networks, which have been trends, and future directions. We conduct the survey applied widely in the computer vision domain [2]–[4]. with a different perspective and taxonomy compared to existing survey papers on the same topic. Informative These models are used to examine facial expressions graphics are provided for guiding readers through the and movements of a person and synthesize facial images latest development in deepfake research. The methods of another person making analogous expressions and surveyed are comprehensive and will be valuable to the movements [5]. Deepfake methods normally require a

arXiv:1909.11573v3 [cs.CV] 26 Apr 2021 artificial intelligence community in tackling the current large amount of image and video data to train models challenges of deepfakes. to create photo-realistic images and videos. As public figures such as celebrities and politicians may have a Keywords: deepfakes, face manipulation, artificial intelli- large number of videos and images available online, they gence, deep learning, autoencoders, GAN, forensics, survey. are initial targets of deepfakes. Deepfakes were used to swap faces of celebrities or politicians to bodies in porn T. T. Nguyen and D. T. Nguyen are with the School of Information Technology, Deakin University, Victoria, Australia. images and videos. The first deepfake video emerged in Q. V. H. Nguyen is with School of Information and Communication 2017 where face of a celebrity was swapped to the face Technology, Griffith University, Queensland, Australia. of a porn actor. It is threatening to world security when C. M. Nguyen is with LAMIH UMR CNRS 8201, Universite´ Polytechnique Hauts-de-France, 59313 Valenciennes, France. deepfake methods can be employed to create videos D. Nguyen is with Faculty of Information Technology, Monash of world leaders with fake speeches for falsification University, Victoria, Australia. purposes [6]–[8]. Deepfakes therefore can be abused to S. Nahavandi is with the Institute for Intelligent Systems Research cause political or religion tensions between countries, to and Innovation, Deakin University, Victoria, Australia. Corresponding e-mail: [email protected] (T. T. fool public and affect results in election campaigns, or Nguyen). create chaos in financial markets by creating 2

[9]–[11]. It can be even used to generate fake satellite Deepfake Detection Challenge to catalyse more research images of the Earth to contain objects that do not really and development in detecting and preventing deepfakes exist to confuse military analysts, e.g., creating a fake from being used to mislead viewers [25]. Data obtained bridge across a river although there is no such a bridge from https://app.dimensions.ai at the end of 2020 show in reality. This can mislead a troop who have been guided that the number of deepfake papers has increased signif- to cross the bridge in a battle [12], [13]. icantly in recent years (Fig. 1). Although the obtained As the democratization of creating realistic digital numbers of deepfake papers may be lower than actual humans has positive implications, there is also positive numbers but the research trend of this topic is obviously use of deepfakes such as their applications in visual increasing. effects, digital avatars, snapchat filters, creating voices of those who have lost theirs or updating episodes of movies without reshooting them [14]. However, the number of malicious uses of deepfakes largely dominates that of the positive ones. The development of advanced deep neural networks and the availability of large amount of data have made the forged images and videos almost indistinguishable to humans and even to sophisticated computer algorithms. The process of creating those ma- nipulated images and videos is also much simpler today as it needs as little as an identity photo or a short video of a target individual. Less and less effort is required to produce a stunningly convincing tempered footage. Recent advances can even create a deepfake with just a still image [15]. Deepfakes therefore can be a threat Fig. 1. Number of papers related to deepfakes in years from 2016 to 2020, obtained from https://app.dimensions.ai at the end of 2020 affecting not only public figures but also ordinary people. with the search keyword “deepfake” applied to full text of scholarly For example, a voice deepfake was used to scam a CEO papers. The number of such papers in 2018, 2019 and 2020 are 64, out of $243,000 [16]. A recent release of a software 368 and 1268, respectively. called DeepNude shows more disturbing threats as it can transform a person to a non-consensual porn [17]. This paper presents a survey of methods for creating Likewise, the Chinese app Zao has gone viral lately as well as detecting deepfakes. There have been existing as less-skilled users can swap their faces onto bodies survey papers about this topic in [26]–[28], we however of movie stars and insert themselves into well-known carry out the survey with different perspective and taxon- movies and TV clips [18]. These forms of falsification omy. In Section II, we present the principles of deepfake create a huge threat to violation of privacy and identity, algorithms and how deep learning has been used to and affect many aspects of human lives. enable such disruptive technologies. Section III reviews Finding the truth in digital domain therefore has different methods for detecting deepfakes as well as their become increasingly critical. It is even more challeng- advantages and disadvantages. We discuss challenges, ing when dealing with deepfakes as they are majorly research trends and directions on deepfake detection and used to serve malicious purposes and almost anyone multimedia forensics problems in Section IV. can create deepfakes these days using existing deepfake tools. Thus far, there have been numerous methods II.DEEPFAKE CREATION proposed to detect deepfakes [19]–[23]. Most of them Deepfakes have become popular due to the quality are based on deep learning, and thus a battle between of tampered videos and also the easy-to-use ability of malicious and positive uses of deep learning methods their applications to a wide range of users with various has been arising. To address the threat of face-swapping computer skills from professional to novice. These ap- technology or deepfakes, the United States Defense plications are mostly developed based on deep learning Advanced Research Projects Agency (DARPA) initiated techniques. Deep learning is well known for its capability a research scheme in media forensics (named Media of representing complex and high-dimensional data. One Forensics or MediFor) to accelerate the development variant of the deep networks with that capability is of fake digital visual media detection methods [24]. deep autoencoders, which have been widely applied Recently, Inc. teaming up with Corp for dimensionality reduction and and the Partnership on AI coalition have launched the [29]–[31]. The first attempt of deepfake creation was 3

TABLE I SUMMARY OF NOTABLE DEEPFAKE TOOLS

Tools Links Key Features Faceswap https://github.com/deepfakes/faceswap - Using two encoder-decoder pairs. - Parameters of the encoder are shared. Faceswap-GAN https://github.com/shaoanlu/faceswap-GAN Adversarial loss and perceptual loss (VGGface) are added to an auto-encoder archi- tecture. Few-Shot Face https://github.com/shaoanlu/fewshot-face- - Use a pre-trained face recognition model to extract latent embeddings for GAN Translation translation-GAN processing. - Incorporate semantic priors obtained by modules from FUNIT [42] and SPADE [43]. DeepFaceLab https://github.com/iperov/DeepFaceLab - Expand from the Faceswap method with new models, e.g. H64, H128, LIAEF128, SAE [44]. - multiple face extraction modes, e.g. S3FD, MTCNN, , or manual [44]. DFaker https://github.com/dfaker/df - DSSIM loss [45] is used to reconstruct face. - Implemented based on library. DeepFake tf https://github.com/StromWine/DeepFake tf Similar to DFaker but implemented based on tensorflow. AvatarMe https://github.com/lattas/AvatarMe - Reconstruct 3D faces from arbitrary “in-the-wild” images. - Can reconstruct authentic 4K by 6K-resolution 3D faces from a single low-resolution image [46]. MarioNETte https://hyperconnect.github.io/MarioNETte - A few-shot face reenactment framework that preserves the target identity. - No additional fine-tuning phase is needed for identity adaptation [47]. DiscoFaceGAN https://github.com/microsoft/DiscoFaceGAN - Generate face images of virtual people with independent latent variables of identity, expression, pose, and illumination. - Embed 3D priors into adversarial learning [48]. StyleRig https://gvv.mpi-inf.mpg.de/projects/StyleRig - Create portrait images of faces with a rig-like control over a pretrained and fixed StyleGAN via 3D morphable face models. - Self-supervised without manual annotations [49]. FaceShifter https://lingzhili.com/FaceShifterPage - Face swapping in high-fidelity by exploiting and integrating the target attributes. - Can be applied to any new face pairs without requiring subject specific training [50]. FSGAN https://github.com/YuvalNirkin/fsgan - A face swapping and reenactment model that can be applied to pairs of faces without requiring training on those faces. - Adjust to both pose and expression variations [51]. Transformable https://github.com/kyleolsz/TB-Networks - A method for fine-grained 3D manipulation of image content. Bottleneck - Apply spatial transformations in CNN models using a transformable bottleneck Networks framework [52]. “Do as ” github.com/carolineec/EverybodyDanceNow - Automatically transfer the motion from a source to a target person by learning a Motion Trans- video-to-video translation. fer - Can create a motion-synchronized dancing video with multiple subjects [53]. Neural Voice https://justusthies.github.io/posts/neural- - A method for audio-driven facial video synthesis. Puppetry voice-puppetry - Synthesize videos of a talking head from an audio sequence of another person using 3D face representation. [54].

FakeApp, developed by a user using - decoder pairing structure [32], [33]. In that method, the autoencoder extracts latent features of face images and the decoder is used to reconstruct the face images. To swap faces between source images and target images, there is a need of two encoder-decoder pairs where each pair is used to train on an image set, and the encoder’s parameters are shared between two network pairs. In other words, two pairs have the same encoder network. This strategy enables the common encoder to find and learn the similarity between two sets of face images, which are relatively unchallenging because faces normally have similar features such as eyes, nose, mouth positions. Fig. 2 shows a deepfake creation process where the feature set of face A is connected with the Fig. 2. A deepfake creation model using two encoder-decoder pairs. decoder B to reconstruct face B from the original face A. Two networks use the same encoder but different decoders for training This approach is applied in several works such as Deep- process (top). An image of face A is encoded with the common FaceLab [34], DFaker [35], DeepFake tf (tensorflow- encoder and decoded with decoder B to create a deepfake (bottom). based deepfakes) [36]. 4

By adding adversarial loss and perceptual loss im- plemented in VGGFace [37] to the encoder-decoder architecture, an improved version of deepfakes based on the generative adversarial network (GAN) [38], i.e. faceswap-GAN, was proposed in [39]. The VGGFace perceptual loss is added to make eye movements to be more realistic and consistent with input faces and help to smooth out artifacts in segmentation mask, leading to higher quality output videos. This model facilitates the creation of outputs with 64x64, 128x128, and 256x256 resolutions. In , the multi-task convolutional Fig. 3. Categories of reviewed papers relevant to deepfake detection methods where we divide papers into two major groups, i.e. fake neural network (CNN) from the FaceNet implementation image detection and face video detection. [40] is introduced to make face detection more stable and face alignment more reliable. The CycleGAN [41] is utilized for generative network implementation. Popular about the critical need of future development of more deepfake tools and their typical features are summarized robust methods that can detect deepfakes from genuine. in Table I. This section presents a survey of deepfake detection methods where we them into two major cate- gories: fake image detection methods and fake video III.DEEPFAKE DETECTION detection ones (see Fig. 3). The latter is distinguished Deepfakes are increasingly detrimental to privacy, so- into two smaller groups: visual artifacts within single ciety security and democracy [55]. Methods for detecting video frame-based methods and temporal features across deepfakes have been proposed as soon as this threat was frames-based ones. Whilst most of the methods based on introduced. Early attempts were based on handcrafted temporal features use deep learning recurrent classifica- features obtained from artifacts and inconsistencies of tion models, the methods use visual artifacts within video the fake video synthesis process. Recent methods, on the frame can be implemented by either deep or shallow other hand, applied deep learning to automatically extract classifiers. salient and discriminative features to detect deepfakes [56], [57]. A. Fake Image Detection Deepfake detection is normally deemed a binary clas- Face swapping has a number of compelling applica- sification problem where classifiers are used to classify tions in video compositing, transfiguration in portraits, between authentic videos and tampered ones. This kind and especially in identity protection as it can replace of methods requires a large database of real and fake faces in photographs by ones from a collection of stock videos to train classification models. The number of fake images. However, it is also one of the techniques that videos is increasingly available, but it is still limited cyber attackers employ to penetrate identification or in terms of setting a benchmark for validating various authentication systems to gain illegitimate access. The detection methods. To address this issue, Korshunov use of deep learning such as CNN and GAN has made and Marcel [58] produced a notable deepfake dataset swapped face images more challenging for forensics consisting of 620 videos based on the GAN model using models as it can preserve pose, facial expression and the open source code Faceswap-GAN [39]. Videos from lighting of the photographs [66]. Zhang et al. [67] used the publicly available VidTIMIT database [59] were used the bag of words method to extract a set of compact to generate low and high quality deepfake videos, which features and fed it into various classifiers such as SVM can effectively mimic the facial expressions, mouth [68], random forest (RF) [69] and multi- percep- movements, and eye blinking. These videos were then trons (MLP) [70] for discriminating swapped face im- used to test various deepfake detection methods. Test ages from the genuine. Among deep learning-generated results show that the popular face recognition systems images, those synthesised by GAN models are probably based on VGG [60] and Facenet [40], [61] are unable to most difficult to detect as they are realistic and high- detect deepfakes effectively. Other methods such as lip- quality based on GAN’s capability to learn distribution syncing approaches [62]–[64] and image quality metrics of the complex input data and generate new outputs with with support vector machine (SVM) [65] produce very similar input distribution. high error rate when applied to detect deepfake videos Most works on detection of GAN generated images from this newly produced dataset. This raises concerns however do not consider the generalization capability 5 of the detection models although the development of general dataset is extracted from the ILSVRC12 [86]. GAN is ongoing, and many new extensions of GAN are The large scale GAN training model for high fidelity frequently introduced. Xuan et al. [71] used an image natural image synthesis (BIGGAN) [87], self-attention preprocessing step, e.g. and Gaussian GAN [88] and spectral normalization GAN [89] are noise, to remove low level high frequency clues of GAN used to generate fake images with size of 128x128. The images. This increases the pixel level statistical similarity training set consists of 600,000 fake and real images between real images and fake images and requires the whilst the test set includes 10,000 images of both types. forensic classifier to learn more intrinsic and meaningful Experimental results show the superior performance of features, which has better generalization capability than the proposed method against its competing methods such previous image forensics methods [72], [73] or image as those introduced in [90]–[93]. steganalysis networks [74]. On the other hand, Agarwal and Varshney [75] cast the B. Fake Video Detection GAN-based deepfake detection as a hypothesis testing problem where a statistical framework was introduced Most image detection methods cannot be used for using the information-theoretic study of authentication videos because of the strong degradation of the frame [76]. The minimum distance between distributions of data after video compression [94]. Furthermore, videos legitimate images and images generated by a particular have temporal characteristics that are varied among sets GAN is defined, namely the oracle error. The analytic of frames and thus challenging for methods designed to results show that this distance increases when the GAN detect only still fake images. This subsection focuses on is less accurate, and in this case, it is easier to detect deepfake video detection methods and categorizes them deepfakes. In case of high-resolution image inputs, an into two smaller groups: methods that employ temporal extremely accurate GAN is required to generate fake features and those that explore visual artifacts within images that are hard to detect. frames. 1) Temporal Features across Video Frames: Recently, Hsu et al. [77] introduced a two-phase deep Based on learning method for detection of deepfake images. The the observation that temporal coherence is not enforced first phase is a feature extractor based on the common effectively in the synthesis process of deepfakes, Sabir et fake feature network (CFFN) where the Siamese net- al. [95] leveraged the use of spatio-temporal features of work architecture presented in [78] is used. The CFFN video streams to detect deepfakes. is encompasses several dense units with each unit including carried out on a frame-by-frame basis so that low level different numbers of dense blocks [79] to improve the artifacts produced by face manipulations are believed to representative capability for the fake images. The number further manifest themselves as temporal artifacts with of dense units is three or five depending on the validation inconsistencies across frames. A recurrent convolutional data being face or general images, and the number of model (RCN) was proposed based on the integration of channels in each unit is varied up to a few hundreds. Dis- the convolutional network DenseNet [79] and the gated criminative features between the fake and real images, recurrent unit cells [96] to exploit temporal discrepancies i.e. pairwise information, are extracted through CFFN across frames (see Fig. 4). The proposed method is learning process. These features are then fed into the on the FaceForensics++ dataset, which includes 1,000 second phase, which is a small CNN concatenated to the videos [97], and shows promising results. last convolutional layer of CFFN to distinguish deceptive images from genuine. The proposed method is validated for both fake face and fake general image detection. On the one hand, the face dataset is obtained from CelebA [80], containing 10,177 identities and 202,599 aligned face images of various poses and background clutter. Fig. 4. A two-step process for face manipulation detection where the Five GAN variants are used to generate fake images with preprocessing step aims to detect, crop and align faces on a sequence size of 64x64, including deep convolutional GAN (DC- of frames and the second step distinguishes manipulated and authentic face images by combining convolutional neural network (CNN) and GAN) [81], Wasserstein GAN (WGAN) [82], WGAN () [95]. with gradient penalty (WGAN-GP) [83], least squares GAN [84], and progressive growth of GAN (PGGAN) Likewise, Guera and Delp [98] highlighted that deep- [85]. A total of 385,198 training images and 10,000 fake videos contain intra-frame inconsistencies and tem- test images of both real and fake ones are obtained for poral inconsistencies between frames. They then pro- validating the proposed method. On the other hand, the posed the temporal-aware pipeline method that uses 6

CNN and long short term memory (LSTM) to detect above the threshold of 0.5 with duration less than 7 deepfake videos. CNN is employed to extract frame-level frames. This method is evaluated on a dataset collected features, which are then fed into the LSTM to create a from the web consisting of 49 interview and presentation temporal sequence descriptor. A fully-connected network videos and their corresponding fake videos generated is finally used for classifying doctored videos from real by the deepfake algorithms. The experimental results ones based on the sequence descriptor as illustrated in indicate promising performance of the proposed method Fig. 5. in detecting fake videos, which can be further improved by considering dynamic pattern of blinking, e.g. highly frequent blinking may also be a sign of tampering. 2) Visual Artifacts within Video Frame: As can be noticed in the previous subsection, the methods using temporal patterns across video frames are mostly based on deep recurrent network models to detect deepfake Fig. 5. A deepfake detection method using convolutional neural videos. This subsection investigates the other approach network (CNN) and long short term memory (LSTM) to extract that normally decomposes videos into frames and ex- temporal features of a given video sequence, which are represented plores visual artifacts within single frames to obtain via the sequence descriptor. The detection network consisting of fully- connected layers is employed to take the sequence descriptor as input discriminant features. These features are then distributed and calculate of the frame sequence belonging to either into either a deep or shallow classifier to differentiate be- authentic or deepfake class [98]. tween fake and authentic videos. We thus group methods in this subsection based on the types of classifiers, i.e. On the other hand, the use of a physiological , either deep or shallow. eye blinking, to detect deepfakes was proposed in [99] a) Deep classifiers: Deepfake videos are normally based on the observation that a person in deepfakes created with limited resolutions, which require an affine has a lot less frequent blinking than that in untampered face warping approach (i.e., scaling, rotation and shear- videos. A healthy adult human would normally blink ing) to match the configuration of the original ones. somewhere between 2 to 10 seconds, and each blink Because of the resolution inconsistency between the would take 0.1 and 0.4 seconds. Deepfake algorithms, warped face area and the surrounding context, this however, often use face images available online for process leaves artifacts that can be detected by CNN training, which normally show people with open eyes, models such as VGG16 [101], ResNet50, ResNet101 i.e. very few images published on the internet show and ResNet152 [102]. A deep learning method to detect people with closed eyes. Thus, without having access to deepfakes based on the artifacts observed during the face images of people blinking, deepfake algorithms do not warping step of the deepfake generation algorithms was have the capability to generate fake faces that can blink proposed in [103]. The proposed method is evaluated normally. In other words, blinking rates in deepfakes are on two deepfake datasets, namely the UADFV and much lower than those in normal videos. To discriminate DeepfakeTIMIT. The UADFV dataset [104] contains 49 real and fake videos, Li et al. [99] first decompose the real videos and 49 fake videos with 32,752 frames in videos into frames where face regions and then eye areas total. The DeepfakeTIMIT dataset [64] includes a set of are extracted based on six eye landmarks. After few low quality videos of 64 x 64 size and another set of high steps of pre-processing such as aligning faces, extracting quality videos of 128 x 128 with totally 10,537 pristine and scaling the bounding boxes of eye landmark points images and 34,023 fabricated images extracted from 320 to create new sequences of frames, these cropped eye videos for each quality set. Performance of the proposed area sequences are distributed into long-term recurrent method is compared with other prevalent methods such convolutional networks (LRCN) [100] for dynamic state as two deepfake detection MesoNet methods, i.e. Meso- prediction. The LRCN consists of a feature extractor 4 and MesoInception-4 [94], HeadPose [104], and the based on CNN, a sequence learning based on long short face tampering detection method two-stream NN [105]. term memory (LSTM), and a state prediction based on Advantage of the proposed method is that it needs not a fully connected layer to predict of eye to generate deepfake videos as negative examples before open and close state. The eye blinking shows strong training the detection models. Instead, the negative ex- temporal dependencies and thus the implementation of amples are generated dynamically by extracting the face LSTM helps to capture these temporal patterns effec- region of the original image and aligning it into multiple tively. The blinking rate is calculated based on the scales before applying Gaussian blur to a scaled image prediction results where a blink is defined as a peak of random pick and warping back to the original image. 7

This reduces a large amount of time and computational et al. [104] proposed a detection method by observing resources compared to other methods, which require the differences between 3D head poses comprising head deepfakes are generated in advance. orientation and position, which are estimated based on Nguyen et al. [106] proposed the use of capsule 68 facial landmarks of the central face region. The 3D networks for detecting manipulated images and videos. head poses are examined because there is a shortcoming The capsule network was initially introduced to address in the deepfake face generation pipeline. The extracted limitations of CNNs when applied to inverse graphics features are fed into an SVM classifier to obtain the tasks, which aim to find physical processes used to pro- detection results. Experiments on two datasets show the duce images of the world [107]. The recent development great performance of the proposed approach against its of capsule network based on dynamic routing competing methods. The first dataset, namely UADFV, [108] demonstrates its ability to describe the hierarchical consists of 49 deep fake videos and their respective real pose relationships between object parts. This develop- videos [104]. The second dataset comprises 241 real ment is employed as a component in a pipeline for images and 252 deep fake images, which is a subset detecting fabricated images and videos as demonstrated of data used in the DARPA MediFor GAN Image/Video in Fig. 6. A dynamic routing algorithm is deployed to Challenge [113]. Likewise, a method to exploit artifacts route the outputs of the three capsules to the output of deepfakes and face manipulations based on visual capsules through a number of iterations to separate features of eyes, teeth and facial contours was studied in between fake and real images. The method is evaluated [114]. The visual artifacts arise from lacking global con- through four datasets covering a wide range of forged sistency, wrong or imprecise estimation of the incident image and video attacks. They include the well-known illumination, or imprecise estimation of the underlying Idiap Research Institute replay-attack dataset [109], the geometry. For deepfakes detection, missing reflections deepfake face swapping dataset created by Afchar et and missing details in the eye and teeth areas are al. [94], the facial reenactment FaceForensics dataset exploited as well as texture features extracted from the [110], produced by the Face2Face method [111], and facial region based on facial landmarks. Accordingly, the the fully computer-generated image dataset generated by eye feature vector, teeth feature vector and features ex- Rahmouni et al. [112]. The proposed method yields the tracted from the full-face crop are used. After extracting best performance compared to its competing methods the features, two classifiers including logistic regression in all of these datasets. This shows the potential of the and small neural network are employed to classify the capsule network in building a general detection system deepfakes from real videos. Experiments carried out on that can work effectively for various forged image and a video dataset downloaded from YouTube show the best video attacks. result of 0.851 in terms of the area under the receiver operating characteristics curve. The proposed method however has a disadvantage that requires images meeting certain prerequisite such as open eyes or visual teeth. The use of photo response non uniformity (PRNU) analysis was proposed in [115] to detect deepfakes from authentic ones. PRNU is a component of sensor pattern noise, which is attributed to the manufacturing imperfec- tion of silicon wafers and the inconsistent sensitivity of pixels to light because of the variation of the physical characteristics of the silicon wafers. When a photo is Fig. 6. Capsule network takes features obtained from the VGG-19 taken, the sensor imperfection is introduced into the network [101] to distinguish fake images or videos from the real ones high-frequency bands of the content in the form of invisi- (top). The pre-processing step detects face region and scales it to the ble noise. Because the imperfection is not uniform across size of 128x128 before VGG-19 is used to extract latent features for the silicon wafer, even sensors made from the silicon the capsule network, which comprises three primary capsules and two output capsules, one for real and one for fake images (bottom). The wafer produce PRNU. Therefore, PRNU is often statistical pooling constitutes an important part of capsule network considered as the fingerprint of digital cameras left in that deals with forgery detection [106]. the images by the cameras [116]. The analysis is widely used in image forensics [117]–[120] and advocated to b) Shallow classifiers: Deepfake detection methods use in [115] because the swapped face is supposed mostly rely on the artifacts or inconsistency of intrinsic to alter the local PRNU pattern in the facial area of features between fake and real images or videos. Yang video frames. The videos are converted into frames, 8 which are cropped to the questioned facial region. The their sabotage strategy without using social media. For cropped frames are then separated sequentially into eight example, this approach can be utilized by intelligence groups where an average PRNU pattern is computed services trying to influence decisions made by important for each group. Normalised cross correlation scores are people such as politicians, leading to national and in- calculated for comparisons of PRNU patterns among ternational security threats [124]. Catching the deepfake these groups. The authors in [115] created a test dataset alarming problem, research community has focused on consisting of 10 authentic videos and 16 manipulated developing deepfake detection algorithms and numerous videos, where the fake videos were produced from the results have been reported. This paper has reviewed genuine ones by the DeepFaceLab tool [34]. The analysis the state-of-the-art methods and a summary of typical shows a significant statistical difference in terms of mean approaches is provided in Table II. It is noticeable that a normalised cross correlation scores between deepfakes battle between those who use advanced and the genuine. This analysis therefore suggests that to create deepfakes with those who make effort to detect PRNU has a potential in deepfake detection although a deepfakes is growing. larger dataset would need to be tested. Deepfakes’ quality has been increasing and the per- When seeing a video or image with suspicion, users formance of detection methods needs to be improved normally want to search for its origin. However, there accordingly. The inspiration is that what AI has broken is currently no feasibility for such a tool. Hasan and can be fixed by AI as well [125]. Detection methods are Salah [121] proposed the use of and smart still in their early stage and various methods have been contracts to help users detect deepfake videos based proposed and evaluated but using fragmented datasets. on the assumption that videos are only real when their An approach to improve performance of detection meth- sources are traceable. Each video is associated with a ods is to create a growing updated benchmark dataset smart contract that links to its parent video and each of deepfakes to validate the ongoing development of parent video has a link to its child in a hierarchical struc- detection methods. This will facilitate the training pro- ture. Through this chain, users can credibly trace back cess of detection models, especially those based on deep to the original smart contract associated with pristine learning, which requires a large training set [126]. video even if the video has been copied multiple times. On the other hand, current detection methods mostly An important attribute of the smart contract is the unique focus on drawbacks of the deepfake generation pipelines, hashes of the interplanetary file system, which is used i.e. finding weakness of the competitors to attack them. to store video and its metadata in a decentralized and This kind of information and knowledge is not always content-addressable manner [122]. The smart contract’s available in adversarial environments where attackers key features and functionalities are tested against several commonly attempt not to reveal such deepfake creation common security challenges such as distributed denial technologies. Recent works on adversarial perturbation of services, replay and man in the middle attacks to attacks to fool DNN-based detectors make the deepfake ensure the solution meeting security requirements. This detection task more difficult [127]–[131]. These are approach is generic, and it can be extended to other types real challenges for detection method development and of digital content, e.g., images, audios and manuscripts. a future research needs to focus on introducing more robust, scalable and generalizable methods. IV. DISCUSSIONSAND FUTURE RESEARCH Another research direction is to integrate detection DIRECTIONS methods into distribution platforms such as social me- Deepfakes have begun to erode trust of people in dia to increase its effectiveness in dealing with the media contents as seeing them is no longer commen- widespread impact of deepfakes. The screening or filter- surate with believing in them. They could cause distress ing mechanism using effective detection methods can be and negative effects to those targeted, heighten disin- implemented on these platforms to ease the deepfakes formation and hate speech, and even could stimulate detection [124]. Legal requirements can be made for political tension, inflame the public, violence or war. tech companies who own these platforms to remove This is especially critical nowadays as the technologies deepfakes quickly to reduce its impacts. In addition, for creating deepfakes are increasingly approachable and watermarking tools can also be integrated into devices social media platforms can spread those fake contents that people use to make digital contents to create im- quickly [123]. Sometimes deepfakes do not need to be mutable metadata for storing originality details such as spread to massive audience to cause detrimental effects. time and location of multimedia contents as well as their People who create deepfakes with malicious purpose untampered attestment [124]. This integration is difficult only need to deliver them to target audiences as part of to implement but a solution for this could be the use 9 of the disruptive blockchain technology. The blockchain Using detection methods to spot deepfakes is crucial, has been used effectively in many areas and there are but understanding the real intent of people publishing very few studies so far addressing the deepfake detection deepfakes is even more important. This requires the problems based on this technology. As it can create judgement of users based on social context in which a chain of unique unchangeable blocks of metadata, deepfake is discovered, e.g. who distributed it and what it is a great tool for digital provenance solution. The they said about it [132]. This is critical as deepfakes integration of blockchain technologies to this problem are getting more and more photorealistic and it is highly has demonstrated certain results [121] but this research anticipated that detection software will be lagging behind direction is far from mature. deepfake creation technology. A study on social context

TABLE II SUMMARY OF PROMINENT DEEPFAKE DETECTION METHODS

Methods Classifiers/ Key Features Dealing Datasets Used Techniques with Eye blinking [99] LRCN - Use LRCN to learn the temporal patterns of eye Videos Consist of 49 interview and presentation videos, blinking. and their corresponding generated deepfakes. - Based on the observation that blinking frequency of deepfakes is much smaller than normal. Intra-frame CNN and CNN is employed to extract frame-level features, Videos A collection of 600 videos obtained from multiple and temporal LSTM which are distributed to LSTM to construct sequence websites. inconsistencies descriptor useful for classification. [98] Using face VGG16 [101] Artifacts are discovered using CNN models based Videos - UADFV [104], containing 49 real videos and 49 warping artifacts ResNet50, on resolution inconsistency between the warped face fake videos with 32752 frames in total. [103] 101 or 152 area and the surrounding context. - DeepfakeTIMIT [64] [102] MesoNet [94] CNN - Two deep networks, i.e. Meso-4 and Videos Two datasets: deepfake one constituted from online MesoInception-4 are introduced to examine videos and the FaceForensics one created by the deepfake videos at the mesoscopic analysis level. Face2Face approach [111]. - Accuracy obtained on deepfake and FaceForensics datasets are 98% and 95% respectively. Capsule- Capsule - Latent features extracted by VGG-19 network [101] Videos/ Four datasets: the Idiap Research Institute replay- forensics [106] networks are fed into the capsule network for classification. Images attack [109], deepfake face swapping by [94], - A dynamic routing algorithm [108] is used to route facial reenactment FaceForensics [110], and fully the outputs of three convolutional capsules to two computer-generated image set using [112]. output capsules, one for fake and another for real images, through a number of iterations. Head poses [104] SVM - Features are extracted using 68 landmarks of the Videos/ - UADFV consists of 49 deep fake videos and their face region. Images respective real videos. - Use SVM to classify using the extracted features. - 241 real images and 252 deep fake images from DARPA MediFor GAN Image/Video Challenge. Eye, teach and Logistic - Exploit facial texture differences, and missing Videos A video dataset downloaded from YouTube. facial texture regression reflections and details in eye and teeth areas of [114] and neural deepfakes. network - Logistic regression and neural network are used for classifying. Spatio-temporal RCN Temporal discrepancies across frames are explored Videos FaceForensics++ dataset, including 1,000 videos features with using RCN that integrates convolutional network [97]. RCN [95] DenseNet [79] and the gated recurrent unit cells [96] Spatio-temporal Convolutional - An XceptionNet CNN is used for facial feature Videos FaceForensics++ [97] and Celeb-DF (5,639 deep- features with bidirectional extraction while audio embeddings are obtained by fake videos) [141] datasets and the ASVSpoof LSTM [140] recurrent stacking multiple modules. 2019 Logical Access audio dataset [142]. LSTM - Two loss functions, i.e. cross-entropy and network Kullback-Leibler divergence, are used. Analysis of PRNU - Analysis of noise patterns of light sensitive sensors Videos Created by the authors, including 10 authentic and PRNU [115] of digital cameras due to their factory defects. 16 deepfake videos using DeepFaceLab [34]. - Explore the differences of PRNU patterns be- tween the authentic and deepfake videos because face swapping is believed to alter the local PRNU patterns. Phoneme-viseme CNN - Exploit the mismatches between the dynamics Videos Four in-the-wild lip-sync deepfakes from mismatches of the mouth shape, i.e. visemes, with a spoken and YouTube (www.instagram.com/bill posters uk [133] phoneme. and youtu.be/VWMEDacz3L4) and others are cre- - Focus on sounds associated with the M, B and P ated using synthesis techniques, i.e. Audio-to- phonemes as they require complete mouth Video (A2V) [63] and Text-to-Video (T2V) [134]. while deepfakes often incorrectly synthesize it. 10

Methods Classifiers/ Key Features Dealing Datasets Used Techniques with Using ResNet50 - The ABC metric [137] is used to detect deepfake Videos VidTIMIT and two other original attribution- model [102], videos without accessing to training data. datasets obtained from the COHFACE based pre-trained on - ABC values obtained for original videos are greater (https://www.idiap.ch/dataset/cohface) and confidence VGGFace2 than 0.94 while those of deepfakes have low ABC from YouTube. datasets from COHFACE (ABC) metric [136] values. [138] and YouTube are used to generate two [135] deepfake datasets by commercial website https://deepfakesweb.com and another deepfake dataset is DeepfakeTIMIT [139]. Emotion audio- Siamese Modality and emotion embedding vectors for the Videos DeepfakeTIMIT [139] and DFDC [144]. visual affective network [78] face and speech are extracted for deepfake detection. cues [143] Using Rules defined Temporal, behavioral biometric based on facial ex- Videos The world leaders dataset [1], FaceForensics++ appearance based on facial pressions and head movements are learned using [97], /Jigsaw deepfake detection dataset and behaviour and behavioural ResNet-101 [102] while static facial biometric is [146], DFDC [144] and Celeb-DF [141]. [145] features. obtained using VGG [60]. FakeCatcher CNN Extract biological in portrait videos and use Videos UADFV [104], FaceForensics [110], FaceForen- [147] them as an implicit descriptor of authenticity because sics++ [97], Celeb-DF [141], and a new dataset of they are not spatially and temporally well-preserved 142 videos, independent of the generative model, in deepfakes. resolution, compression, content, and context. Preprocessing DCGAN, - Enhance generalization ability of deep learning Images - Real dataset: CelebA-HQ [85], including high combined with WGAN-GP models to detect GAN generated images. quality face images of 1024x1024 resolution. deep network and PGGAN. - Remove low level features of fake images. - Fake datasets: generated by DCGAN [81], [71] - Force deep networks to focus more on pixel level WGAN-GP [83] and PGGAN [85]. similarity between fake and real images to improve generalization ability. Bag of words SVM, RF, MLP Extract discriminant features using bag of words Images The well-known LFW face database [148], con- and shallow method and feed these features into SVM, RF and taining 13,223 images with resolution of 250x250. classifiers [67] MLP for binary classification: innocent vs fabricated. Pairwise learn- CNN concate- Two-phase procedure: feature extraction using CFFN Images - Face images: real ones from CelebA [80], and ing [77] nated to CFFN based on the Siamese network architecture [78] and fake ones generated by DCGAN [81], WGAN classification using CNN. [82], WGAN-GP [83], least squares GAN [84], and PGGAN [85]. - General images: real ones from ILSVRC12 [86], and fake ones generated by BIGGAN [87], self- attention GAN [88] and spectral normalization GAN [89]. Defenses VGG [60] and - Introduce adversarial perturbations to enhance Images 5,000 real images from CelebA [80] and 5,000 fake against ResNet [102] deepfakes and fool deepfake detectors. images created by the “Few-Shot Face Translation adversarial - Improve accuracy of deepfake detectors using Lips- GAN” method [149]. perturbations in chitz regularization and deep image prior techniques. deepfakes [127] Analyzing KNN, SVM, Using expectation maximization algorithm to extract Images Authentic images from CelebA and correspond- convolutional and linear local features pertaining to convolutional generative ing deepfakes are created by five different GANs traces [150] discriminant process of GAN-based image deepfake generators. (group-wise deep whitening-and-coloring transfor- analysis mation GDWCT [151], StarGAN [152], AttGAN [153], StyleGAN [154], StyleGAN2 [155]). Face X-ray CNN - Try to locate the blending boundary between the Images FaceForensics++ [97], DeepfakeDetection (DFD) [156] target and original faces instead of capturing the [146], DFDC [144] and Celeb-DF [141]. synthesized artifacts of specific manipulations. - Can be trained without fake images. Using common ResNet-50 Train the classifier using a large number of fake Images A new dataset of CNN-generated images, namely artifacts of [102] pre- images generated by a high-performing uncondi- ForenSynths, consisting of synthesized images CNN-generated trained with tional GAN model, i.e., ProGAN [85] and evaluate from 11 models such as StyleGAN [154], super- images [157] ImageNet [86] how well the classifier generalizes to other CNN- resolution methods [158] and FaceForensics++ synthesized images. [97]. of deepfakes to assist users in such judgement is thus and thus the experts’ opinions may not be enough to worth performing. authenticate these evidences because even experts are Videos and photographics have been widely used as unable to discern manipulated contents. This aspect evidences in police investigation and justice cases. They needs to take into account in courtrooms nowadays when may be introduced as evidences in a court of law by images and videos are used as evidences to convict digital media forensics experts who have background in perpetrators because of the existence of a wide range of computer or law enforcement and experience in collect- digital manipulation methods [159]. The digital media ing, examining and analysing digital information. The forensics results therefore must be proved to be valid development of machine learning and AI technologies and reliable before they can be used in courts. This might have been used to modify these digital contents 11 requires careful documentation for each step of the foren- [11] Guo, B., Ding, Y., Yao, L., Liang, Y., and Yu, Z. (2020). The sics process and how the results are reached. Machine future of false information detection on social media: new perspectives and trends. ACM Computing Surveys (CSUR), learning and AI algorithms can be used to support the 53(4), 1-36. determination of the authenticity of digital media and [12] Tucker, P. (2019, March 31). The newest AI-enabled have obtained accurate and reliable results, e.g. [160], weapon: ‘Deep-Faking’ photos of the earth. Available at [161], but most of these algorithms are unexplainable. https://www.defenseone.com/technology/2019/03/next-phase- ai-deep-faking-whole-world-and-china-ahead/155944/ This creates a huge hurdle for the applications of AI [13] Fish, T. (2019, April 4). Deep fakes: AI-manipulated in forensics problems because not only the forensics media will be ‘weaponised’ to trick military. Avail- experts oftentimes do not have expertise in computer able at https://www.express.co.uk/news/science/1109783/deep- algorithms, but the computer professionals also cannot fakes-ai-artificial-intelligence-photos-video-weaponised-china [14] Marr, B. (2019, July 22). The best (and scariest) explain the results properly as most of these algorithms examples of AI-enabled deepfakes. Available at are black box models [162]. This is more critical as https://www.forbes.com/sites/bernardmarr/2019/07/22/the- the most recent models with the most accurate results best-and-scariest-examples-of-ai-enabled-deepfakes/ [15] Zakharov, E., Shysheya, A., Burkov, E., and Lempitsky, V. are based on deep learning methods consisting of many (2019). Few-shot adversarial learning of realistic neural talking neural network parameters. Explainable AI in computer head models. arXiv preprint arXiv:1905.08233. vision therefore is a research direction that is needed to [16] Damiani, J. (2019, September 3). A voice deepfake was promote and utilize the advances and advantages of AI used to scam a CEO out of $243,000. Available at https://www.forbes.com/sites/jessedamiani/2019/09/03/a- and machine learning in digital media forensics. voice-deepfake-was-used-to-scam-a-ceo-out-of-243000/ [17] Samuel, S. (2019, June 27). A guy made a deepfake app to turn photos of women into nudes. It didn’t go well. Available at https://www.vox.com/2019/6/27/18761639/ai-deepfake- REFERENCES deepnude-app-nude-women-porn [18] (2019, September 2). Chinese deepfake [1] Agarwal, S., Farid, H., Gu, Y., He, M., Nagano, K., and Li, app Zao sparks privacy row after going viral. Available at H. (2019, June). Protecting world leaders against deep fakes. https://www.theguardian.com/technology/2019/sep/02/chinese- In Computer Vision and Workshops (pp. face-swap-app-zao-triggers-privacy-fears-viral 38-45). [19] Lyu, S. (2020, July). Deepfake detection: current challenges [2] Tewari, A., Zollhoefer, M., Bernard, F., Garrido, P., Kim, H., and next steps. In IEEE International Conference on Multime- Perez, P., and Theobalt, C. (2020). High-fidelity monocular dia and Expo Workshops (ICMEW) (pp. 1-6). IEEE. face reconstruction based on an unsupervised model-based [20] Guarnera, L., Giudice, O., Nastasi, C., and Battiato, S. (2020). face autoencoder. IEEE Transactions on Pattern Analysis and Preliminary forensics analysis of deepfake images. arXiv Machine Intelligence, 42(2), 357-370. preprint arXiv:2004.12626. [21] Jafar, M. T., Ababneh, M., Al-Zoube, M., and Elhassan, A. [3] Lin, J., Li, Y., & Yang, G. (2021). FPGAN: Face de- (2020, April). Forensics and analysis of deepfake videos. identification method with generative adversarial networks for In The 11th International Conference on Information and social robots. Neural Networks, 133, 132-147. Communication Systems (ICICS) (pp. 053-058). IEEE. [4] Liu, M. Y., Huang, X., Yu, J., Wang, T. C., & Mallya, A. [22] Trinh, L., Tsang, M., Rambhatla, S., and Liu, Y. (2020). (2021). Generative adversarial networks for image and video Interpretable deepfake detection via dynamic prototypes. arXiv synthesis: Algorithms and applications. Proceedings of the preprint arXiv:2006.15473. IEEE, DOI: 10.1109/JPROC.2021.3049196. [23] Younus, M. A., and Hasan, T. M. (2020, April). Effective [5] Lyu, S. (2018, August 29). Detecting ‘deepfake’ and fast deepfake detection method based on Haar videos in the blink of an eye. Available at transform. In 2020 International Conference on Computer http://theconversation.com/detecting-deepfake-videos-in- Science and Software (CSASE) (pp. 186-190). the-blink-of-an-eye-101072 IEEE. [6] Bloomberg (2018, September 11). How faking videos [24] Turek, M. (2019). Media Forensics (MediFor). Available at became easy and why that’s so scary. Available at https://www.darpa.mil/program/media-forensics https://fortune.com/2018/09/11/deep-fakes-obama-video/ [25] Schroepfer, M. (2019, September 5). Creating a [7] Chesney, R., and Citron, D. (2019). Deepfakes and the new data set and a challenge for deepfakes. Available at war: The coming age of post-truth geopolitics. https://ai.facebook.com/blog/deepfake-detection-challenge Foreign Affairs, 98, 147. [26] Tolosana, R., Vera-Rodriguez, R., Fierrez, J., Morales, A., & [8] Hwang, T. (2020). Deepfakes: A Grounded Threat Assessment. Ortega-Garcia, J. (2020). Deepfakes and beyond: A survey of Centre for Security and Emerging Technologies, Georgetown face manipulation and fake detection. Information Fusion, 64, University. 131-148. [9] Zhou, X., and Zafarani, R. (2020). A survey of fake [27] Verdoliva, L. (2020). Media forensics and deepfakes: an news: fundamental theories, detection methods, and overview. IEEE Journal of Selected Topics in Signal Process- opportunities. ACM Computing Surveys (CSUR), DOI: ing, 14(5), 910-932. https://doi.org/10.1145/3395046. [28] Mirsky, Y., & Lee, W. (2021). The creation and detection of [10] Kaliyar, R. K., Goswami, A., and Narang, P. (2020). Deepfake: deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1), improving fake news detection using tensor decomposition 1-41. based deep neural network. Journal of Supercomputing, DOI: [29] Punnappurath, A., and Brown, M. S. (2019). Learning raw https://doi.org/10.1007/s11227-020-03294-y. image reconstruction-aware deep image compressors. IEEE 12

Transactions on Pattern Analysis and Machine Intelligence. [51] Nirkin, Y., Keller, Y., & Hassner, T. (2019). FSGAN: subject DOI: 10.1109/TPAMI.2019.2903062. agnostic face swapping and reenactment. In Proceedings of [30] Cheng, Z., Sun, H., Takeuchi, M., and Katto, J. (2019). the IEEE/CVF International Conference on Computer Vision Energy compaction-based image compression using convolu- (pp. 7184-7193). tional autoencoder. IEEE Transactions on Multimedia. DOI: [52] Olszewski, K., Tulyakov, S., Woodford, O., Li, H., & Luo, L. 10.1109/TMM.2019.2938345. (2019). Transformable bottleneck networks. In Proceedings of [31] Chorowski, J., Weiss, R. J., Bengio, S., and Oord, A. V. the IEEE/CVF International Conference on Computer Vision D. (2019). Unsupervised speech representation learning using (pp. 7648-7657). autoencoders. IEEE/ACM Transactions on Audio, [53] Chan, C., Ginosar, S., Zhou, T., & Efros, A. A. (2019). Speech, and Language Processing. 27(12), pp. 2041-2053. Everybody dance now. In Proceedings of the IEEE/CVF In- [32] Faceswap: Deepfakes software for all. Available at ternational Conference on Computer Vision (pp. 5933-5942). https://github.com/deepfakes/faceswap [54] Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Nießner, [33] FakeApp 2.2.0. Available at M. (2020, August). Neural voice puppetry: Audio-driven facial https://www.malavida.com/en/soft/fakeapp/ reenactment. In European Conference on Computer Vision (pp. [34] DeepFaceLab. Available at 716-731). Springer, Cham. https://github.com/iperov/DeepFaceLab [55] Chesney, R., and Citron, D. K. (2018). Deep fakes: a loom- [35] DFaker. Available at https://github.com/dfaker/df ing challenge for privacy, democracy, and national security. [36] DeepFake tf: Deepfake based on tensorflow. Available at https://dx.doi.org/10.2139/ssrn.3213954. https://github.com/StromWine/DeepFake tf [56] de Lima, O., Franklin, S., Basu, S., Karwoski, B., and George, [37] Keras-VGGFace: VGGFace implementation with Keras frame- A. (2020). Deepfake detection using spatiotemporal convolu- work. Available at https://github.com/rcmalli/keras-vggface tional networks. arXiv preprint arXiv:2006.14749. [38] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- [57] Amerini, I., and Caldelli, R. (2020, June). Exploiting pre- Farley, D., Ozair, S., ... and Bengio, Y. (2014). Generative ad- diction error inconsistencies through LSTM-based classifiers versarial nets. In Advances in Neural Information Processing to detect deepfake videos. In Proceedings of the 2020 ACM Systems (pp. 2672-2680). Workshop on Information Hiding and Multimedia Security (pp. [39] Faceswap-GAN. Available at 97-102). https://github.com/shaoanlu/faceswap-GAN. [58] Korshunov, P., and Marcel, S. (2019). Vulnerability assessment [40] FaceNet. Available at https://github.com/davidsandberg/facenet. and detection of deepfake videos. In The 12th IAPR Interna- [41] CycleGAN. Available at https://github.com/junyanz/pytorch- tional Conference on Biometrics (ICB), pp. 1-6. CycleGAN-and-pix2pix. [59] VidTIMIT database. Available at [42] Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehti- http://conradsanderson.id.au/vidtimit/ nen, J., and Kautz, J. (2019). Few-shot unsupervised image- [60] Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015, Septem- to-image translation. In Proceedings of the IEEE International ber). Deep face recognition. In Proceedings of the British Conference on Computer Vision (pp. 10551-10560). Conference (BMVC) (pp. 41.1-41.12). [43] Park, T., Liu, M. Y., Wang, T. C., and Zhu, J. Y. (2019). Se- [61] Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: mantic image synthesis with spatially-adaptive normalization. A unified embedding for face recognition and clustering. In In Proceedings of the IEEE Conference on Computer Vision Proceedings of the IEEE Conference on Computer Vision and and Pattern Recognition (pp. 2337-2346). Pattern Recognition (pp. 815-823). [44] DeepFaceLab: Explained and usage tutorial. Available [62] Chung, J. S., Senior, A., Vinyals, O., and Zisserman, A. at https://mrdeepfakes.com/forums/thread-deepfacelab- (2017, July). Lip reading sentences in the wild. In 2017 explained-and-usage-tutorial. IEEE Conference on Computer Vision and Pattern Recognition [45] DSSIM. Available at https://github.com/keras-team/keras- (CVPR) (pp. 3444-3453). contrib/blob/master/keras contrib/losses/dssim.py. [63] Suwajanakorn, S., Seitz, S. M., and Kemelmacher-Shlizerman, [46] Lattas, A., Moschoglou, S., Gecer, B., Ploumpis, S., Tri- I. (2017). Synthesizing Obama: learning lip sync from audio. antafyllou, V., Ghosh, A., & Zafeiriou, S. (2020). AvatarMe: ACM Transactions on Graphics (TOG), 36(4), 1–13. realistically renderable 3D facial reconstruction ”in-the-wild”. [64] Korshunov, P., and Marcel, S. (2018, September). Speaker In Proceedings of the IEEE/CVF Conference on Computer inconsistency detection in tampered video. In 2018 26th Eu- Vision and Pattern Recognition (pp. 760-769). ropean Conference (EUSIPCO) (pp. 2375- [47] Ha, S., Kersner, M., Kim, B., Seo, S., & Kim, D. (2020, April). 2379). IEEE. MarioNETte: few-shot face reenactment preserving identity of [65] Galbally, J., and Marcel, S. (2014, August). Face anti-spoofing unseen targets. In Proceedings of the AAAI Conference on based on general image quality assessment. In 2014 22nd Artificial Intelligence (vol. 34, no. 07, pp. 10893-10900). International Conference on Pattern Recognition (pp. 1173- [48] Deng, Y., Yang, J., Chen, D., Wen, F., & Tong, X. 1178). IEEE. (2020). Disentangled and controllable face image genera- [66] Korshunova, I., Shi, W., Dambre, J., and Theis, L. (2017). Fast tion via 3D imitative-contrastive learning. In Proceedings of face-swap using convolutional neural networks. In Proceedings the IEEE/CVF Conference on Computer Vision and Pattern of the IEEE International Conference on Computer Vision (pp. Recognition (pp. 5154-5163). 3677-3685). [49] Tewari, A., Elgharib, M., Bharaj, G., Bernard, F., Seidel, H. [67] Zhang, Y., Zheng, L., and Thing, V. L. (2017, August). P., Perez,´ P., ... & Theobalt, C. (2020). StyleRig: Rigging Automated face swapping and its detection. In 2017 IEEE StyleGAN for 3D control over portrait images. In Proceedings 2nd International Conference on Signal and Image Processing of the IEEE/CVF Conference on Computer Vision and Pattern (ICSIP) (pp. 15-19). IEEE. Recognition (pp. 6142-6151). [68] Wang, X., Thome, N., and Cord, M. (2017). Gaze latent [50] Li, L., Bao, J., Yang, H., Chen, D., & Wen, F. (2019). support vector machine for image classification improved by FaceShifter: Towards high fidelity and occlusion aware face weakly supervised region selection. Pattern Recognition, 72, swapping. arXiv preprint arXiv:1912.13457. 59-71. 13

[69] Bai, S. (2017). Growing random forest on deep convolutional [87] Brock, A., Donahue, J., and Simonyan, K. (2018). Large scale neural networks for scene categorization. Expert Systems with GAN training for high fidelity natural image synthesis. arXiv Applications, 71, 279-287. preprint arXiv:1809.11096. [70] Zheng, L., Duffner, S., Idrissi, K., Garcia, C., and Baskurt, [88] Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. (2018). A. (2016). Siamese multi-layer for dimensionality Self-attention generative adversarial networks. arXiv preprint reduction and face identification. Multimedia Tools and Appli- arXiv:1805.08318. cations, 75(9), 5055-5073. [89] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). [71] Xuan, X., Peng, B., Dong, J., and Wang, W. (2019). On Spectral normalization for generative adversarial networks. the generalization of GAN image forensics. arXiv preprint arXiv preprint arXiv:1802.05957. arXiv:1902.11153. [90] Farid, H. (2009). Image forgery detection. IEEE Signal Pro- [72] Yang, P., Ni, R., and Zhao, Y. (2016, September). Recapture cessing Magazine, 26(2), 16-25. image forensics based on Laplacian convolutional neural net- [91] Mo, H., Chen, B., and Luo, W. (2018, June). Fake faces iden- works. In International Workshop on Digital Watermarking tification via convolutional neural network. In Proceedings of (pp. 119-128). the 6th ACM Workshop on Information Hiding and Multimedia [73] Bayar, B., and Stamm, M. C. (2016, June). A deep learning Security (pp. 43-47). approach to universal image manipulation detection using a [92] Marra, F., Gragnaniello, D., Cozzolino, D., and Verdoliva, L. new convolutional layer. In Proceedings of the 4th ACM (2018, April). Detection of GAN-generated fake images over Workshop on Information Hiding and Multimedia Security (pp. social networks. In 2018 IEEE Conference on Multimedia 5-10). ACM. Information Processing and Retrieval (MIPR) (pp. 384-389). [74] Qian, Y., Dong, J., Wang, W., and Tan, T. (2015, March). Deep IEEE. learning for steganalysis via convolutional neural networks. In [93] Hsu, C. C., Lee, C. Y., and Zhuang, Y. X. (2018, December). Media Watermarking, Security, and Forensics 2015 (Vol. 9409, Learning to detect fake face images in the wild. In 2018 p. 94090J). International Symposium on Computer, Consumer and Control [75] Agarwal, S., and Varshney, L. R. (2019). Limits of deep- (IS3C) (pp. 388-391). IEEE. fake detection: A robust estimation viewpoint. arXiv preprint [94] Afchar, D., Nozick, V., Yamagishi, J., and Echizen, I. (2018, arXiv:1905.03493. December). MesoNet: a compact facial video forgery detection [76] Maurer, U. M. (2000). Authentication theory and hypothesis network. In 2018 IEEE International Workshop on Information testing. IEEE Transactions on , 46(4), Forensics and Security (WIFS) (pp. 1-7). IEEE. 1350-1356. [95] Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I., and Natarajan, P. (2019). Recurrent convolutional strategies for [77] Hsu, C. C., Zhuang, Y. X., and Lee, C. Y. (2020). Deep fake face manipulation detection in videos. In Proceedings of the image detection based on pairwise learning. Applied Sciences, IEEE Conference on Computer Vision and Pattern Recognition 10(1), 370. Workshops (pp. 80-87). [78] Chopra, S. (2005). Learning a similarity metric discrimina- [96] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., tively, with application to face verification. In IEEE Confer- Bougares, F., Schwenk, H., and Bengio, Y. (2014, October). ence on Compter Vision and Pattern Recognition (pp. 539- Learning phrase representations using RNN encoder–decoder 546). for statistical machine translation. In Proceedings of the 2014 [79] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, Conference on Empirical Methods in Natural Language Pro- K. Q. (2017). Densely connected convolutional networks. In cessing (EMNLP) (pp. 1724-1734). Proceedings of the IEEE Conference on Computer Vision and [97] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Pattern Recognition (pp. 4700-4708). & Nießner, M. (2019). Faceforensics++: Learning to detect [80] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep manipulated facial images. In Proceedings of the IEEE/CVF learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1-11). International Conference on Computer Vision (pp. 3730-3738). [98] Guera, D., and Delp, E. J. (2018, November). Deepfake video [81] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised detection using recurrent neural networks. In 2018 15th IEEE representation learning with deep convolutional generative International Conference on Advanced Video and Signal Based adversarial networks. arXiv preprint arXiv:1511.06434. Surveillance (AVSS) (pp. 1-6). IEEE. [82] Arjovsky, M., Chintala, S., and Bottou, L. (2017, July). [99] Li, Y., Chang, M. C., and Lyu, S. (2018, December). In Wasserstein generative adversarial networks. In International ictu oculi: Exposing AI created fake videos by detecting eye Conference on Machine Learning (pp. 214-223). blinking. In 2018 IEEE International Workshop on Information [83] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Forensics and Security (WIFS) (pp. 1-7). IEEE. Courville, A. C. (2017). Improved training of Wasserstein [100] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, GANs. In Advances in Neural Information Processing Systems M., Venugopalan, S., Saenko, K., and Darrell, T. (2015). Long- (pp. 5767-5777). term recurrent convolutional networks for visual recognition [84] Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul and description. In Proceedings of the IEEE Conference on Smolley, S. (2017). Least squares generative adversarial net- Computer Vision and Pattern Recognition (pp. 2625-2634). works. In Proceedings of the IEEE International Conference [101] Simonyan, K., and Zisserman, A. (2014). Very deep con- on Computer Vision (pp. 2794-2802). volutional networks for large-scale image recognition. arXiv [85] Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Pro- preprint arXiv:1409.1556. gressive growing of GANs for improved quality, , and [102] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual variation. arXiv preprint arXiv:1710.10196. learning for image recognition. In Proceedings of the IEEE [86] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Conference on Computer Vision and Pattern Recognition (pp. Ma, S., ... and Berg, A. C. (2015). ImageNet large scale vi- 770-778). sual recognition challenge. International Journal of Computer [103] Li, Y., and Lyu, S. (2019). Exposing deepfake videos by Vision, 115(3), 211-252. detecting face warping artifacts. In Proceedings of the IEEE 14

Conference on Computer Vision and Pattern Recognition analysis. IEEE Transactions on Biometrics, Behavior, and Workshops (pp. 46-52). Identity Science, 1(4), 302-317. [104] Yang, X., Li, Y., and Lyu, S. (2019, May). Exposing deep [120] Phan, Q. T., Boato, G., and De Natale, F. G. (2019). Accurate fakes using inconsistent head poses. In 2019 IEEE Interna- and scalable image clustering based on sparse representation of tional Conference on , Speech and Signal Processing camera fingerprint. IEEE Transactions on Information Foren- (ICASSP) (pp. 8261-8265). IEEE. sics and Security, 14(7), 1902-1916. [105] Zhou, P., Han, X., Morariu, V. I., and Davis, L. S. (2017, [121] Hasan, H. R., and Salah, K. (2019). Combating deepfake July). Two-stream neural networks for tampered face detection. videos using blockchain and smart contracts. IEEE Access, In 2017 IEEE Conference on Computer Vision and Pattern 7, 41596-41606. Recognition Workshops (CVPRW) (pp. 1831-1839). IEEE. [122] IPFS powers the Distributed Web. Available at https://ipfs.io/ [106] Nguyen, H. H., Yamagishi, J., and Echizen, I. (2019, May). [123] Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., and Procter, Capsule-forensics: Using capsule networks to detect forged R. (2018). Detection and resolution of in social images and videos. In 2019 IEEE International Conference media: A survey. ACM Computing Surveys (CSUR), 51(2), 1- on Acoustics, Speech and Signal Processing (ICASSP) (pp. 36. 2307-2311). IEEE. [124] Chesney, R. and Citron, D. K. (2018, October 16). Disin- [107] Hinton, G. E., Krizhevsky, A., and Wang, S. D. (2011, June). formation on steroids: The threat of deep fakes. Available at Transforming auto-encoders. In International Conference on https://www.cfr.org/report/deep-fake-disinformation-steroids. Artificial Neural Networks (pp. 44-51). Springer, Berlin, Hei- [125] Floridi, L. (2018). Artificial intelligence, deepfakes and a delberg. future of ectypes. Philosophy and Technology, 31(3), 317-321. [108] Sabour, S., Frosst, N., and Hinton, G. E. (2017). Dynamic [126] Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, routing between capsules. In Advances in Neural Information M., and Ferrer, C. C. (2020). The deepfake detection challenge Processing Systems (pp. 3856-3866). dataset. arXiv preprint arXiv:2006.07397. [109] Chingovska, I., Anjos, A., and Marcel, S. (2012, September). [127] Gandhi, A., and Jain, S. (2020). Adversarial perturbations fool On the effectiveness of local binary patterns in face anti- deepfake detectors. arXiv preprint arXiv:2003.10596. spoofing. In Proceedings of the International Conference of [128] Neekhara, P., Hussain, S., Jere, M., Koushanfar, F., and Biometrics Special Interest Group (BIOSIG) (pp. 1-7). IEEE. McAuley, J. (2020). Adversarial deepfakes: evaluating vulner- [110] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., ability of deepfake detectors to adversarial examples. arXiv and Nießner, M. (2018). FaceForensics: A large-scale video preprint arXiv:2002.12749. dataset for forgery detection in human faces. arXiv preprint [129] Carlini, N., and Farid, H. (2020). Evading deepfake-image arXiv:1803.09179. detectors with white-and black-box attacks. In Proceedings of [111] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., and the IEEE/CVF Conference on Computer Vision and Pattern Nießner, M. (2016). Face2Face: Real-time face capture and Recognition Workshops (pp. 658-659). reenactment of RGB videos. In Proceedings of the IEEE [130] Yang, C., Ding, L., Chen, Y., and Li, H. (2020). Defending Conference on Computer Vision and Pattern Recognition (pp. against GAN-based deepfake attacks via transformation-aware 2387-2395). adversarial faces. arXiv preprint arXiv:2006.07421. [112] Rahmouni, N., Nozick, V., Yamagishi, J., and Echizen, I. [131] Yeh, C. Y., Chen, H. W., Tsai, S. L., and Wang, S. D. (2020). (2017, December). Distinguishing computer graphics from Disrupting image-translation-based deepfake algorithms with natural images using convolution neural networks. In 2017 adversarial attacks. In Proceedings of the IEEE Winter Con- IEEE Workshop on Information Forensics and Security (WIFS) ference on Applications of Computer Vision Workshops (pp. (pp. 1-6). IEEE. 53-62). [113] Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A. N., [132] Read, M. (2019, June 27). Can you spot Delgado, A., ... and Fiscus, J. (2019, January). MFC datasets: a deepfake? Does it matter? Available at Large-scale benchmark datasets for media forensic challenge http://nymag.com/intelligencer/2019/06/how-do-you-spot- evaluation. In 2019 IEEE Winter Applications of Computer a-deepfake-it-might-not-matter.html. Vision Workshops (WACVW) (pp. 63-72). [133] Agarwal, S., Farid, H., Fried, O., and Agrawala, M. [114] Matern, F., Riess, C., and Stamminger, M. (2019, January). (2020). Detecting deep-fake videos from phoneme-viseme Exploiting visual artifacts to expose deepfakes and face ma- mismatches. In Proceedings of the IEEE/CVF Conference on nipulations. In 2019 IEEE Winter Applications of Computer Computer Vision and Pattern Recognition Workshops (pp. 660- Vision Workshops (WACVW) (pp. 83-92). IEEE. 661). [115] Koopman, M., Rodriguez, A. M., and Geradts, Z. (2018). [134] Fried, O., Tewari, A., Zollhofer,¨ M., Finkelstein, A., Shecht- Detection of deepfake video manipulation. In The 20th Irish man, E., Goldman, D. B., ... and Agrawala, M. (2019). Text- Machine Vision and Image Processing Conference (IMVIP) based editing of talking-head video. ACM Transactions on (pp. 133-136). Graphics (TOG), 38(4), 1-14. [116] Rosenfeld, K., and Sencar, H. T. (2009, February). A study of [135] Fernandes, S., Raj, S., Ewetz, R., Singh Pannu, J., Kumar the robustness of PRNU-based camera identification. In Media Jha, S., Ortiz, E., ... and Salter, M. (2020). Detecting deepfake Forensics and Security (Vol. 7254, p. 72540M). International videos using attribution-based confidence metric. In Proceed- Society for and Photonics. ings of the IEEE/CVF Conference on Computer Vision and [117] Li, C. T., and Li, Y. (2012). Color-decoupled photo response Pattern Recognition Workshops (pp. 308-309). non-uniformity for digital image forensics. IEEE Transactions [136] Cao, Q., Shen, L., Xie, W., Parkhi, O. M., and Zisserman, on Circuits and Systems for Video Technology, 22(2), 260-271. A. (2018, May). VGGFace2: A dataset for recognising faces [118] Lin, X., and Li, C. T. (2017). Large-scale image clustering across pose and age. In 2018 13th IEEE International Confer- based on camera fingerprints. IEEE Transactions on Informa- ence on Automatic Face and Gesture Recognition (pp. 67-74). tion Forensics and Security, 12(4), 793-808. IEEE. [119] Scherhag, U., Debiasi, L., Rathgeb, C., Busch, C., and Uhl, A. [137] Jha, S., Raj, S., Fernandes, S., Jha, S. K., Jha, S., Jalaian, B., (2019). Detection of face morphing attacks based on PRNU ... and Swami, A. (2019). Attribution-based confidence metric 15

for deep neural networks. In Advances in Neural Information want. IEEE Transactions on Image Processing, 28(11), 5464- Processing Systems (pp. 11826-11837). 5478. [138] Fernandes, S., Raj, S., Ortiz, E., Vintila, I., Salter, M., [154] Karras, T., Laine, S., and Aila, T. (2019). A style-based Urosevic, G., and Jha, S. (2019, October). Predicting generator architecture for generative adversarial networks. In rate variations of deepfake videos using neural ODE. In Proceedings of the IEEE Conference on Computer Vision and 2019 IEEE/CVF International Conference on Computer Vision Pattern Recognition (pp. 4401-4410). Workshop (ICCVW) (pp. 1721-1729). IEEE. [155] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and [139] Korshunov, P., and Marcel, S. (2018). Deepfakes: a new threat Aila, T. (2020). Analyzing and improving the image quality to face recognition? assessment and detection. arXiv preprint of StyleGAN. In Proceedings of the IEEE/CVF Conference on arXiv:1812.08685. Computer Vision and Pattern Recognition (pp. 8110-8119). [140] Chintha, A., Thai, B., Sohrawardi, S. J., Bhatt, K. M., Hicker- [156] Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., & Guo, son, A., Wright, M., and Ptucha, R. (2020). Recurrent convolu- B. (2020). Face X-ray for more general face forgery detection. tional structures for audio spoof and video deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer IEEE Journal of Selected Topics in Signal Processing, DOI: Vision and Pattern Recognition (pp. 5001-5010). 10.1109/JSTSP.2020.2999185. [157] Wang, S. Y., Wang, O., Zhang, R., Owens, A., & Efros, A. [141] Li, Y., Yang, X., Sun, P., Qi, H., and Lyu, S. (2020). Celeb-DF: A. (2020). CNN-generated images are surprisingly easy to A large-scale challenging dataset for deepfake forensics. In spot... for now. In Proceedings of the IEEE/CVF Conference Proceedings of the IEEE/CVF Conference on Computer Vision on Computer Vision and Pattern Recognition (pp. 8695-8704). and Pattern Recognition (pp. 3207-3216). [158] Dai, T., Cai, J., Zhang, Y., Xia, S. T., & Zhang, L. [142] Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, (2019). Second-order attention network for single image super- H., Nautsch, A., ... and Lee, K. A. (2019). ASVspoof 2019: resolution. In Proceedings of the IEEE/CVF Conference on Future horizons in spoofed and fake audio detection. arXiv Computer Vision and Pattern Recognition (pp. 11065-11074). preprint arXiv:1904.05441. [159] Maras, M. H., and Alexandrou, A. (2019). Determining au- [143] Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., and thenticity of video evidence in the age of artificial intelligence Manocha, D. (2020). Emotions don’t lie: A deepfake detec- and in the wake of deepfake videos. The International Journal tion method using audio-visual affective cues. arXiv preprint of Evidence and Proof, 23(3), 255-262. arXiv:2003.06711. [160] Su, L., Li, C., Lai, Y., and Yang, J. (2017). A fast forgery detection algorithm based on exponential-Fourier moments for [144] Dolhansky, B., Howes, R., Pflaum, B., Baram, N., and Fer- video region duplication. IEEE Transactions on Multimedia, rer, C. C. (2019). The deepfake detection challenge (DFDC) 20(4), 825-840. preview dataset. arXiv preprint arXiv:1910.08854. [161] Iuliani, M., Shullani, D., Fontani, M., Meucci, S., and Piva, [145] Agarwal, S., El-Gaaly, T., Farid, H., and Lim, S. N. (2020). A. (2018). A video forensic framework for the unsupervised Detecting deep-fake videos from appearance and behavior. analysis of MP4-like file container. IEEE Transactions on arXiv preprint arXiv:2004.14491. Information Forensics and Security, 14(3), 635-645. [146] Dufour, N., and Gully, A. (2019). Contributing [162] Malolan, B., Parekh, A., and Kazi, F. (2020, March). Explain- Data to Deepfake Detection Research. Available at: able deep-fake detection using visual interpretability methods. https://ai.googleblog.com/2019/09/contributing-data-to- In The 3rd International Conference on Information and deepfake-detection.html. Computer Technologies (ICICT) (pp. 289-293). IEEE. [147] Ciftci, U. A., Demir, I., & Yin, L. (2020). FakeCatcher: Detec- tion of synthetic portrait videos using biological signals. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2020.3009287. [148] Huang, G. B., Mattar, M., Berg, T., and Learned-Miller, E. (2007, October). Labelled faces in the wild: A database for studying face recognition in unconstrained environ- ments. Technical Report 07-49, University of Massachusetts, Amherst, http://vis-www.cs.umass.edu/lfw/. Thanh Thi Nguyen was a Visiting Scholar [149] Shaoanlu’s GitHub. (2019). Few-Shot Face Translation with the Computer Science Department at GAN. Available at https://github.com/shaoanlu/fewshot-face- Stanford University, California, USA in 2015 translation-GAN. and the Edge Computing Lab, Harvard Uni- [150] Guarnera, L., Giudice, O., and Battiato, S. (2020). Deepfake versity, Massachusetts, USA in 2019. He re- detection by analyzing convolutional traces. In Proceedings of ceived a European-Pacific Partnership for ICT the IEEE/CVF Conference on Computer Vision and Pattern Expert Exchange Program Award from Eu- Recognition Workshops (pp. 666-667). ropean Commission in 2018, and an Aus- [151] Cho, W., Choi, S., Park, D. K., Shin, I., and Choo, J. (2019). tralia–India Strategic Research Fund Early- Image-to-image translation via group-wise deep whitening- and Mid-Career Fellowship Awarded by the Australian Academy of and-coloring transformation. In Proceedings of the IEEE Con- Science in 2020. Dr. Nguyen obtained a PhD in Mathematics and ference on Computer Vision and Pattern Recognition (pp. from Monash University, Australia and has expertise in 10639-10647). various areas, including AI, deep learning, , [152] Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S., and Choo, computer vision, cyber security, IoT, and data science. He is currently J. (2018). StarGAN: Unified generative adversarial networks a Senior Lecturer in the School of Information Technology, Deakin for multi-domain image-to-image translation. In Proceedings University, Victoria, Australia. of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8789-8797). [153] He, Z., Zuo, W., Kan, M., Shan, S., and Chen, X. (2019). AttGAN: Facial attribute editing by only changing what you 16

Quoc Viet Hung Nguyen received a PhD de- Duc Thanh Nguyen was awarded a PhD gree from EPFL, Switzerland. He is currently in Computer Science from the University of a senior lecturer with Griffith University, Aus- Wollongong, Australia in 2012. Currently, he tralia. He has published several articles in top- is a lecturer in the School of Information tier venues, such as SIGMOD, VLDB, SIGIR, Technology, Deakin University, Australia. His KDD, AAAI, ICDE, IJCAI, JVLDB, TKDE, research interests include computer vision and TOIS, and TIST. His research interests include pattern recognition. He has published his work data mining, data integration, data quality, in- in highly ranked publication venues in Com- formation retrieval, trust management, recom- puter Vision and Pattern Recognition such as mender systems, machine learning, and big data visualization. the Journal of Pattern Recognition, CVPR, ICCV, ECCV. He also has served a technical program committee member for many premium conferences such as CVPR, ICCV, ECCV, AAAI, ICIP, PAKDD and reviewer for the IEEE Trans. Intell. Transp. Syst., the IEEE Trans. Image Process., the IEEE Signal Processing Letters, Image and Vision Computing, Pattern Recognition, Scientific Reports.

Cuong M. Nguyen received the B.Sc. and M.Sc. degrees in Mathematics from Vietnam National University, Hanoi, Vietnam. In 2017, he received the Ph.D. degree from School of Engineering, Deakin University, Australia, where he worked as postdoctoral researcher and sessional lecturer for several years. He is currently a postdoc in Autonomous Vehicles at LAMIH UMR CNRS 8201, Universite´ Poly- technique Hauts-de-France, France. His research interests lie in the areas of Optimization, Machine Learning, and Control Systems.

Saeid Nahavandi received a Ph.D. from Durham University, U.K. in 1991. He is an Alfred Deakin Professor, Pro Vice-Chancellor (Defence Technologies), Chair of Engineering, and the Director for the Institute for Intelligent Systems Research and Innovation at Deakin University, Victoria, Australia. His research interests include modelling of complex sys- tems, machine learning, robotics and haptics. He is a Fellow of Engineers Australia (FIEAust), the Institution of Engineering and Technology (FIET) and IEEE (FIEEE). Dung Tien Nguyen received the B.Eng. and He is the Co-Editor-in-Chief of the IEEE Systems Journal, As- M.Eng. degrees from the People Securiry sociate Editor of the IEEE/ASME Transactions on Mechatronics, Academy and the Vietnam National University Associate Editor of the IEEE Transactions on Systems, Man and University of Enginerring and Technology in Cybernetics: Systems, and an IEEE Access Editorial Board member. 2006 and 2013, respectively. He received his PhD degree in the area of multimodal emotion recognition using deep learning techniques from Queensland University of Technology in Brisbane, Australia in 2019. He is currently working as a research fellow at Deakin University in computer vision, machine learning, deep learning, image processing, and affective computing.