DETECTION: CURRENT CHALLENGES AND NEXT STEPS

Siwei Lyu

Computer Science Department University at Albany, State University of New York

ABSTRACT High quality fake videos and audios generated by AI- algorithms (the deep fakes) have started to challenge the sta- tus of videos and audios as definitive evidence of events. In this paper, we highlight a few of these challenges and discuss the research opportunities in this direction. Index Terms— DeepFake videos, detection techniques, digital media forensics

1. INTRODUCTION

Falsified videos created by AI algorithms, in particular, deep neural networks (DNNs), are a recent twist to the disconcert- ing problem of online disinformation. Although fabrication and manipulation of digital images and videos are not new Fig. 1. Examples of DeepFake videos: (top) Head puppetry, (mid- [1], the rapid development of DNNs in recent years has made dle) face swapping, and (bottom) lip syncing. the process to create convincing fake videos increasingly eas- ier and faster. DNN generated fake videos first caught the DeepFaceLab [6]. There are also emerging online services public’s attention in late 2017, when a Reddit account with that can generate DeepFake videos on demand (https: name began posting synthetic pornographic videos //deepfakesweb.com), and there are many online dis- generated using a DNN-based face-swapping algorithm. Sub- cussion fora on DeepFakes. Furthermore, several start-up sequently, the term DeepFake have been used more broadly to companies also commercialized tools that can potentially be refer to any AI generated impersonating videos. used to make DeepFakes, such as Synthesia1 and Canny AI2. Currently, there are three major types of DeepFake videos. While there are interesting and creative applications of the DeepFake videos, due to the strong association of faces • Head puppetry entails synthesizing a video of a target to the identity of an individual, they can also be weaponized. persons whole head and upper-shoulder using a video Well-crafted DeepFake videos can create illusions of a per- of a source persons head, so the synthesized target ap- son’s presence and activities that do not occur in reality, pears to behave the same way as the source. arXiv:2003.09234v1 [cs.CV] 11 Mar 2020 which can lead to serious political, social, financial, and le- • Face swapping involves generating a video of the tar- gal consequences [7]. The potential threats range from re- get with the faces replaced by synthesized faces of the venge pornographic videos of a victim whose face is synthe- source while keeping the same facial expressions. sized and spliced in, to realistically looking videos of state leaders seeming to make inflammatory comments they never • Lip syncing is to create a falsified video by only manip- actually made, a high-level executive commenting about her ulating the lip region so that the target appears to speak company’s performance to influence the global stock market, something that s/he does not speak in reality. or an online sex predator masquerades visually as a family Figure 1 shows some example frames of each type of member or a friend in a video chat. The high stakes spawn DeepFake videos aforementioned. As the first exam- wide media coverage of this topic in the past two years, and ples of DeepFakes, face swapping has been commer- the US congress has had two public hearings to this problem. cialized and mainstreamed through readily available soft- With the escalated concerns over DeepFakes, there is a ware freely available on GitHub, e.g., FakeApp [2], 1https://www.synthesia.io/. DFaker [3], faceswap-GAN [4], faceswap [5], and 2https://www.cannyai.com/. surge of interest in developing DeepFake detection methods with significant progress witnessed in the past two years. This includes (1) a slew of effective detection methods de- veloped in less than two years, mostly based on deep learn- ing [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]; (2) the availability of several large-scale DeepFake video datasets UADFV [11, 19, 15, 20, 21]; and (3) two public challenges dedicated to DeepFake detection, namely, the DARPA MFC18 Syn- thetic Data Detection Challenge and the DeepFake Detection Challenge3. Notwithstanding this progress, there are a number of criti- cal problems that are yet to be resolved for existing DeepFake detection methods. Furthermore, in the foreseeable future, it is expected that the generation of DeepFake videos will con- tinue evolving, it is thus important to anticipate such new de- DF-TIMIT-HQ velopments and improve the detection methods accordingly. The main objective of this paper is to highlight a few of these challenges and discuss the research opportunities in this di- rection.

2. CURRENT DEEPFAKE DETECTION METHODS FF-DF Current DeepFake detection methods mostly target face- swapping videos, which account for the majorities of Deep- Fake videos circulated online. Many of the existing methods are formulated as frame-level binary classification problems. Based on the features that are used, these methods fall into three major categories. Methods in the first category are based on inconsistencies exhibited in the physical/physiological as- pects in the DeepFake videos. The method in work of [10] DFD exploits the observation that many DeepFake videos lack rea- sonable eye blinking due to the use of online portraits as train- ing data, which usually do not have closed eyes for aesthetic reasons. Incoherent head poses in DeepFake videos are uti- lized in [11] to expose DeepFake videos. In [22], the idiosyn- cratic behavioral patterns of a particular individual are cap- tured by the time series of facial landmarks extracted from real videos are used to spot DeepFake videos. The second category of DeepFake detection algorithms (e.g., [12, 13]) DFDC use signal-level artifacts introduced during the synthesis pro- cess. Also, as synthesized faces are spliced into the original video frames, state-of-the-art DNN splicing detection meth- Fig. 2. Visual artifacts of DeepFake videos in existing datasets, ods, e.g., [23, 24, 25, 26], can be applied. The third category including low-quality, visible splicing boundaries, color mismatch, of DeepFake detection methods (e.g., [8, 9, 16, 18]) are data- visible parts of the original face, and inconsistent face orientations. driven, which directly employ various types of DNNs trained on real and DeepFake videos but capturing specific artifact. Quality of DeepFake Datasets. The availability of large- scale datasets of DeepFake videos is an enabling factor in the development of DeepFake detection method. However, a 2.1. Limitations closer look at the DeepFake videos in existing datasets reveals Albeit impressive progress has been made in the performance some stark contrasts in visual quality to the actual DeepFake of detection of DeepFake videos, there are several concerns videos circulated on the Internet. Several common visual ar- over the current detection methods that suggest caution. tifacts that can be found in these datasets are highlighted in Fig.4, including low-quality synthesized faces, visible splic- 3https://deepfakedetectionchallenge.ai. ing boundaries, color mismatch, visible parts of the original face, and inconsistent synthesized face orientations. These are two issues of this methodology. First, the temporal consis- artifacts are likely the result of imperfect steps of the synthe- tency among frames are not explicitly considered, as (i) many sis method and the lack of curating of the synthesized videos DeepFake videos exhibit temporal artifacts and (ii) real or before included in the datasets. Moreover, DeepFake videos DeepFake frames tend to appear in continuous intervals. Sec- with such low visual qualities can hardly be convincing, and ond, it necessitates an extra step when video-level integrity are unlikely to have real impact. Correspondingly, high de- score is needed: we have to aggregate the scores over individ- tection performance on these dataset may not bear strong rel- ual frames to compute such a score. evance when the detection methods are deployed in the wild. Social Media Laundering. A large fraction of online videos A related issue is that DeepFake detection methods trained are now spread through social networks, e.g., FaceBook, In- using different DF datasets have trouble extending the perfor- stagram, and Twitter. To save network bandwidth and also mance to different datasets [27]. to protect the users’ privacy, these videos are usually striped In a recent work [27], we present a new large-scale chal- off meta-data, down-sized, and then heavy compressed before lenging DeepFake video dataset, Celeb-DF, which contains they are uploaded to the social platforms. These operations, 5, 639 high-quality DeepFake videos of celebrities generated commonly known as social media laundering, are detrimental using improved synthesis process. We conduct a comprehen- to recover traces of underlying manipulation, and at the same sive evaluation of DeepFake detection methods and datasets time increase the false positive detections, i.e., classifying a to demonstrate the escalated level of challenges. real video as a DeepFake. So far, most data-driven DeepFake Performance Evaluation. Currently, the problem of detect- detection methods that use signal level features are much af- ing DeepFake videos is commonly formulated, solved, and fected by social media laundering. A practical measure to evaluated as a binary classification problem, where each video improve the robustness of DeepFake detection methods to so- is categorized as real or a DeepFake. Such dichotomy is easy cial media laundering is to actively incorporate simulations to set up in controlled experiments, where we develop and of such effects in training data, and also enhance evaluation test DeepFake detection algorithms using videos that are ei- datasets to include performance on social media laundered ther pristine or made with DeepFake generation algorithms. videos, both real and synthesized. However, the picture is murkier when the detection method is deployed in real world. For instance, videos can be fabri- 3. FUTURE DIRECTIONS cated or manipulated in ways other than DeepFakes, so not being detected as a DeepFake video does not necessarily sug- Besides continuing improving to solve the aforementioned gest the video is a real one. Also, a DeepFake video may limitations, we also envision a few important directions of be subject to other types of manipulations and a single la- DeepFake detection methods that will receive more attention bel may not comprehensively reflect such. Furthermore, in in the coming years. a video with multiple subjects’ faces only one or a few are Other Forms of DeepFakes. Although face swapping is generated with DeepFake for a fraction of the frames. So the currently the most widely known form of DeepFake videos, binary classification scheme needs to be extended to multi- it is by no means the most effective. In particular, for the class, multi-label, and local classification/detection to fully purpose of impersonating someone, face swapping DeepFake handle the complexities of real world media forgeries. videos have several limitations. Psychological studies [cita- Explainability of Detection Results. Current DeepFake de- tion] show that human face recognition largely relied on in- tection methods are usually designed to perform batch anal- formation gleaned from face shape and hairstyle. As such, to ysis over a large collection videos. However, when the de- create convincing impersonating effect, the person whose face tection methods are used in the field by journalists or law is to be replaced (the target) has to have similar face shape and enforcement, we usually need only to analyze a small num- hairstyle to the person whose face is used for swapping (the ber of videos. Numerical score corresponding to the likeli- donor). Second, as the synthesized faces need to be spliced hood of a video being generated using a synthesis algorithm is into the original video frame, the inconsistencies between the not as useful to the practitioners if it is not corroborated with synthesized region and the rest of the original frame can be proper reasoning of the score. In such scenarios, it is very severe and difficult to conceal. typical to request a justification for the numerical score for In these respects, the other two forms of DeepFake videos, the analysis to be acceptable for publishing or used in court. namely, head puppetry and lip-syncing, are more effective However, many data-driven DF detection methods, especially and thus should become the focus of subsequent research in those based on the use of deep neural networks, usually lack DeepFake detection. Methods studying whole face synthe- explainability due to the black box nature of the DNN models. sis or reenactment have experienced fast development in re- Temporal Aggregation. Most existing DeepFake detection cent years. Although there have not been as many easy-to-use methods are based on binary classification at the frame level, and free open-source software tools generating these types of i.e., determining the likelihood of an individual frame as real DeepFake videos as for the face-swapping videos, the contin- or of DeepFake. Although simple and straightforward, there uing sophistication of the generation algorithms will change Fig. 3. Example frames from the Celeb-DF dataset. Left column is the frame of real videos and right five columns are corresponding DeepFake frames generated using different donor subject. the situation in the near future. Because the synthesized re- most recent Global ASVspoofing Challenge4 dedicated to AI- gion is different from face swapping DeepFake videos (the driven voice conversion detection, and a few dedicated meth- whole face in the former and lip area in the latter), detection ods for audio DeepFake detection, e.g., [30], have also shown methods designed based on artifacts specific to face swap- up recently. In the coming years, we expect more develop- ping are unlikely to be effective for these videos. Correspond- ments in these areas, in particular, those can leverage features ingly, we should develop detection methods that are effective in both visual and audio features of the fake videos. to these types of DeepFake videos. Intent Inference. Even though the potential negative impacts of DeepFake videos are tremendous, in reality, the majority of Audio DeepFakes. AI-based impersonation are not limited to DeepFake videos are not created not with a malicious intent. imagery, recent AI-synthesized content-generation are lead- Many DeepFake videos currently circulated online are of a ing to the creation of highly realistic audios [28, 29]. Us- pranksome, humorous, or satirical nature. As such, it is im- ing synthesized audios of the impersonating target can sig- portant to expose the underlying intent of a DeepFake in the nificantly make the DeepFake videos more convincing and context of legal or journalistic investigation. Inferring inten- compounds its negative impact. As audio signals are 1D sig- tion may require more semantic and contextual understanding nals and have very different nature from images and videos, of the content, few forensic methods are designed to answer different methods need to be developed to specifically tar- this question, but this is certainly a direction that future foren- geting such forgeries. This problem has drawn attention in the speech processing community recently with part of the 4https://www.asvspoof.org/. en rue es fues rrvlini h viewers. the in human revulsion a or unease to of resemblance sense near-identical a arouses a being bearing face generated DeepFake are videos DeepFake of type/aspects what or eyes, human to an exist there ence if as valley such canny the questions of Interesting decep- study their quantitative tiveness. underlying and factors formal psychological of and lack perceptual cur- a recognized, is widely there are rently videos DeepFake online of pacts attacks. Performance adversarial Human and intentional such as handle to term improve we situation a DeepFake features algorithms, fake as level detection video signal current real simulated by a adding used disguise by to video aspect, DeepFake other a net- the also neural in can developed deep measures be general Anti-forensic of models. vulnerability classification work known attacks the anti-forensic to The to due susceptible detection particularly DeepFake videos. are based DeepFake methods network of neural meth- traces deep detection revealing data-driven DeepFake conceal current to of ods vulnerabilities the advantage take of of which developments measures, anticipate anti-forensic also corresponding we methods, detection Fake Anti-forensics on. focus will the methods to sic data training as set reduced. is face synthesis obtained face the AI of (amplified quality perturbations the adversarial that such the use to is by aim Our thesis. 4 Fig. 30 5 h nan alyi h otx eest h hnmnnweeya whereby phenomenon the to refers context ths in valley uncanny The ewe ihqaiyDeFk iesadra videos real and videos DeepFake high-quality between o etrvsaiain odsrc N-ae aedetectors, face DNN-based distract to visualization) better for

. AI System Training Face Detection vriwo h rpsdmto fdsutn Ifc syn- face AI disrupting of method proposed the of Overview Original image 5 o Fvdo,wa sthe is what videos, DF for ihteicesn fetvns fDeep- of effectiveness increasing the With . ute epaedtcinmtosmust methods detection DeepFake Further . lhuhteptnilngtv im- negative potential the Although . + (e.g., (e.g., Face Detector Perturbation Adversarial AI AI System Detection Detection Dataset Results DeepFake Face ) = utntcal differ- noticeable just Perturbed image un- tlmdafrni eerhr.Teftr ilrco the reckon will work. future this in The make we dig- predictions researchers. and technology, forensic makers of forgery media competition the ital between perpetual skills tech- a and is counter-forensic know-hows, There and post-processing real-life laundering, nologies. to media robust social more meth- be detection steps, to The need also adoption. ods practical wide be to for have improved rate, positive false importantly, more and accuracy, tool. one accom- audio in and are together video they synthesis combines the which if Secondly, voices, realistic realistic with more 34]. panied be [33, can works videos recent synthesized in details recov- facial in performance ering GAN demonstrated gener- incorporating have of by which step improved models[32] encoding be the can in this However, is information This ation. of hairs. loss facial the and to skin can- as due they such that details good are produce methods not generation of effi- disadvantage DeepFake critical generation current one Firstly, the and videos. quality fake the visual developments of ciency the technological improve future further several will that predict We im- the process online. to uploaded are use, they can before user videos the and that ages tool or standalone uploaded a are plat- as images/videos sharing personal user’s photo/video a of before service forms a as method implemented generation be perturbation can Fig. adversarial models, proposed synthesis face The AI for 4. low data with training non-faces as many utility and no faces or actual few have to set meth- face detection training face automatic millions, ods, using even collected sometimes in images typically thousands, face of, of number range large High-quality need the models follows. synthesis as face is AI rationale The failures. detection the bations as known patterns designed specially tools. forensic the to attacks, complement such which of victims the develop becoming from to individuals or protect aim images We face fake emerge. synthesized videos AI best after only currently fash- postmortem applicable the a ion, in even operate largely media, will online techniques forensic of propagation the of measures Protection Deep- that impact social cause. the as can of Fakes techniques understanding detection better a in as research well to such invaluable that in doubt are no and studies is There forensics collab- psychology. media social close and digital for perceptual an- in calls be researchers it to among questions, yet oration these have pursue To viewers, the swered. deceiving in effective more ntefc fti,teoealrnigefiiny detection efficiency, running overall the this, of face the In add to is [31] studied recently have we method such One i.e. the , htaeipretbet ua ysbtcnrsl in result can but eyes human to imperceptible are that aesets face .CONCLUSION 4. oee,gvntesedadreach and speed the given However, . desra etrain plue a “pollute” perturbations Adversarial . proactive desra pertur- adversarial prahsto approaches 5. REFERENCES [19] Pavel Korshunov and Sebastien´ Marcel, “Deepfakes: a new threat to face recognition? assessment and detection,” arXiv [1] , Digital Image Forensics, MIT Press, 2012. preprint arXiv:1812.08685, 2018. [2] “FakeApp,” https://www.malavida.com/en/soft/ [20] Nicholas Dufour, Andrew Gully, Per Karlsson, Alexey Victor fakeapp/, Acessed Nov 4, 2019. Vorbyov, Thomas Leung, Jeremiah Childs, and Christoph Bre- [3] “DFaker github,” https://github.com/dfaker/df, gler, “Deepfakes detection dataset by google & jigsaw,” . Accessed Nov 4, 2019. [21] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, [4] “faceswap-GAN github,” https://github.com/ and Cristian Canton Ferrer, “The deepfake detection challenge shaoanlu/faceswap-GAN, Accessed Nov 4, 2019. (DFDC) preview dataset,” arXiv preprint arXiv:1910.08854, [5] “faceswap github,” https://github.com/ 2019. deepfakes/faceswap, Accessed Nov 4, 2019. [22] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki [6] “DeepFaceLab github,” https://github.com/ Nagano, and Hao Li, “Protecting world leaders against deep iperov/DeepFaceLab, Accessed Nov 4, 2019. fakes,” in IEEE Conference on Computer Vision and Pattern [7] Robert Chesney and Danielle Keats Citron, “Deep Fakes: A Recognition Workshops (CVPRW), 2019. Looming Challenge for Privacy, Democracy, and National Se- [23] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis, curity,” 107 California Law Review (2019, Forthcoming); U of “Two-stream neural networks for tampered face detection,” in Texas Law, Public Law Research Paper No. 692; U of Mary- IEEE Conference on Computer Vision and Pattern Recognition land Legal Studies Research Paper No. 2018-21. Workshops (CVPRW), 2017. [8] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao [24] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis, Echizen, “Mesonet: a compact facial video forgery detec- “Learning rich features for image manipulation detection,” in tion network,” in IEEE International Workshop on Information CVPR, 2018. Forensics and Security (WIFS), 2018. [25] Yaqi Liu, Qingxiao Guan, Xianfeng Zhao, and Yun Cao, “Im- [9] David Guera¨ and Edward J Delp, “Deepfake video detection age forgery localization based on multi-scale convolutional using recurrent neural networks,” in AVSS, 2018. neural networks,” in ACM Workshop on Information Hiding [10] Yuezun Li, Ming-Ching Chang, and Siwei Lyu, “In ictu oculi: and Multimedia Security (IHMMSec), 2018. Exposing AI generated fake face videos by detecting eye blink- [26] Jawadul H Bappy, Cody Simons, Lakshmanan Nataraj, ing,” in IEEE International Workshop on Information Foren- BS Manjunath, and Amit K Roy-Chowdhury, “Hybrid lstm sics and Security (WIFS), 2018. and encoder-decoder architecture for detection of image forg- [11] Xin Yang, Yuezun Li, and Siwei Lyu, “Exposing deep fakes eries,” IEEE Transactions on Image Processing (TIP), 2019. using inconsistent head poses,” in IEEE International Confer- [27] Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu, “Celeb-DF: ence on Acoustics, Speech and Signal Processing (ICASSP), A Large-scale Challenging Dataset for DeepFake Forensics,” 2019. in IEEE Conference on Computer Vision and Patten Recogni- [12] Falko Matern, Christian Riess, and Marc Stamminger, “Ex- tion (CVPR), Seattle, WA, United States, 2020. ploiting visual artifacts to expose deepfakes and face manip- [28] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, ulations,” in IEEE Winter Applications of Computer Vision Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Workshops (WACVW), 2019. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” [13] Yuezun Li and Siwei Lyu, “Exposing deepfake videos by de- arXiv preprint arXiv:1710.07654, 2017. tecting face warping artifacts,” in IEEE Conference on Com- [29] Yu Gu and Yongguo Kang, “Multi-task WaveNet: A multi- puter Vision and Pattern Recognition Workshops (CVPRW), task generative model for statistical parametric speech synthe- 2019. sis without fundamental frequency conditions,” in Interspeech, [14] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAl- Hyderabad, India, 2018. mageed, Iacopo Masi, and Prem Natarajan, “Recurrent- [30] Ehab AlBadawy, Siwei Lyu, and Hany Farid, “Detecting ai- convolution approach to deepfake detection-state-of-art results synthesized speech using bispectral analysis,” in Workshop on faceforensics++,” arXiv preprint arXiv:1905.00582, 2019. on Media Forensics (in conjunction with CVPR), Long Beach, [15] Andreas Rossler,¨ Davide Cozzolino, Luisa Verdoliva, Christian CA, United States, 2019. Riess, Justus Thies, and Matthias Nießner, “FaceForensics++: [31] Yuezun Li, Xin Yang, Baoyuan Wu, and Siwei Lyu, “Hiding Learning to detect manipulated facial images,” in ICCV, 2019. faces in plain sight: Disrupting ai face synthesis with adversar- [16] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen, ial perturbations,” 2019. “Capsule-forensics: Using capsule networks to detect forged [32] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing images and videos,” in IEEE International Conference on Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Acoustics, Speech and Signal Processing (ICASSP), 2019. Yoshua Bengio, “Generative adversarial nets,” in Advances [17] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao in Neural Information Processing Systems (NIPS), 2014, pp. Echizen, “Multi-task learning for detecting and segmenting 2672–2680. manipulated facial images and videos,” in IEEE International [33] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen, Conference on Biometrics: Theory, Applications and Systems “Progressive growing of gans for improved quality, stability, (BTAS), 2019. and variation,” The International Conference on Learning Rep- [18] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen, “Use resentations (ICLR), 2017. of a capsule network to detect fake images and videos,” arXiv [34] Tero Karras, Samuli Laine, and Timo Aila, “A style-based preprint arXiv:1910.12467, 2019. generator architecture for generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410.