We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

David Guera¨ 1 Sriram Baireddy 1 Paolo Bestagini 2 Stefano Tubaro 2 Edward J. Delp 1

Abstract Manipulating video content is easier than ever. Due to the misuse potential of manipulated con- video@ H.264 / AVC / MPEG-4 H.264 / AVC / MPEG-4 H.264 / AVC / MPEG-4 codec_long_name AVC / MPEG-4 part 10 AVC / MPEG-4 part 10 AVC / MPEG-4 part 10 tent, multiple detection techniques that analyze video@profile Main High 4:4:4 Predictive High video@ the pixel data from the videos have been pro- 0.02 0.016667 0.0208542 posed. However, clever manipulators should also codec_time_base video@pix_fmt yuv420p gbrp yuv420p carefully forge the metadata and auxiliary header video@level 31 50 40 information, which is harder to do for videos video@bit_rate 9.07642e+06 2.0245e+08 4.46692e+06 than images. In this paper, we propose to iden- tify forged videos by analyzing their multimedia Figure 1. Examples of some of the information extracted from stream descriptors with simple binary classifiers, the video stream descriptors. These descriptors are necessary to completely avoiding the pixel space. Using well- decode and playback a video. known datasets, our results show that this scalable approach can achieve a high manipulation detec- tion score if the manipulators have not done a Due to the ever increasing sophistication of these techniques, careful data sanitization of the multimedia stream uncovering manipulations in videos remains an open prob- descriptors. lem. Existing video manipulation detection solutions focus entirely on the observance of anomalies in the pixel domain of the video. Unfortunately, it can be easily seen from a 1. Introduction game theoretic perspective that, if both manipulators and detectors are equally powerful, a Nash equilibrium will be Video manipulation is now within reach of any individual. reached (Stamm et al., 2012a). Under that scenario, both Recent improvements in the machine learning field have real and manipulated videos will be indistinguishable from enabled the creation of powerful video manipulation tools. each other, and the best detector will only be capable of ran- Face2Face (Thies et al., 2016), Recycle-GAN (Bansal et al., dom guessing. Hence, methods that look beyond the pixel 2018), (Korshunov & Marcel, 2018), and other domain are critically needed. So far, little attention has been face swapping techniques (Korshunova et al., 2017) embody paid to the necessary metadata and auxiliary header infor- the latest generation of these open source video forging mation that is embedded in every video. As we shall present, methods. It is assumed as a certainty both by the research this information can be exploited to uncover unskilled video community (Brundage et al., 2018) and governments across content manipulators. the globe (Vincent, 2018; Chesney & Citron, 2018) that

arXiv:1906.08743v1 [cs.LG] 20 Jun 2019 more complex tools will appear in the near future. Classical In this paper, we introduce a new approach to address the and current methods have already demon- video manipulation detection problem. To avoid the zero- strated dangerous potential, having been used to generate sum, leader-follower game that characterizes current detec- political (Bird, 2015), revenge-porn (Curtis, tion solutions, our approach completely avoids the pixel 2018), and child-exploitation material (Cole, 2018). domain. Instead, we use the multimedia stream descrip- tors (Jack, 2007) that ensure the playback of any video (as 1Video and Image Processing Laboratory (VIPER), Purdue shown in Figure1). First, we construct a feature vector with University, West Lafayette, Indiana, USA 2Dipartimento di all the descriptor information for a given video. Using a Informazione, Elettronica e Bioingegneria, Politecnico di Mi- database of known manipulated videos, we train an ensem- lano, Milano, Italy. Correspondence to: David Guera¨ , Edward J. Delp . acts as our detector. Finally, during testing, we generate the Proceedings of the 36 th International Conference on Machine feature vector from the stream descriptors of the video under Learning, Long Beach, California, PMLR 97, 2019. Copyright analysis, feed it to the ensemble, and report a manipulation 2019 by the author(s). probability. We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

The contributions of this paper are summarized as follows. 3. Proposed Method First, we introduce a new technique that does not require access to the pixel content of the video, making it fast and Current video manipulation detection approaches rely on un- scalable, even on consumer grade computing equipment. covering manipulations by studying pixel domain anomalies. Instead, we rely on the multimedia descriptors present on Instead, we propose to use the multimedia stream descrip- any video, which are considerably harder to manipulate due tors of videos as our main source of information to spot to their role in the decoding phase. Second, we thoroughly manipulated content. To do so, our method works in two test our approach using the NIST MFC datasets (Guan et al., stages, as presented in Figure2. First, during the training 2019) and show that even with a limited amount of labeled phase, we extract the multimedia stream descriptors from videos, simple machine learning ensembles can be highly a labeled database of manipulated and pristine videos. In effective detectors. Finally, all of our code and trained clas- practice, such a database can be easily constructed using sifiers will be made available1 so the research community a limited amount of manually labeled data coupled with a can reproduce our work with their own datasets. semi-supervised learning approach, as done by (Zannettou et al., 2018). Then, we encode these descriptors as a feature vector for each given video. We apply median normaliza- 2. Related Work tion to all numerical features. As for categorical features, The multimedia forensics research community has a long each is encoded as its own unique numerical value. Once history of trying to address the problem of detecting manip- we have processed all the videos in the database, we use ulations in video sequences. (Milani et al., 2012) provide all the feature vectors to train different binary classifiers as an extensive and thorough overview of the main research our detectors. More specifically, we use a random forest, directions and solutions that have been explored in the last a support vector machine (SVM) and an ensemble of both decade. More recent work has focused on specific video detectors. The best hyperparameters for each detector are manipulations, such as local tampering detection in video se- selected by performing a random search cross-validation quences (Stamm et al., 2012b; Bestagini et al., 2013), video over a 10-split stratified shuffling of the data and 1,000 tri- re-encoding detection (Bian et al., 2014; Bestagini et al., als per split. Figure 2a summarizes this first stage. In our 2016), splicing detection in videos (Hsu et al., 2008; Mul- implementation, we use ffprobe (Bellard et al., 2019) for the lan et al., 2017; Mandelli et al., 2018), and near-duplicate multimedia stream descriptor extraction. For the encoding video detection (Bayram et al., 2008; Lameri et al., 2017). of the descriptors as feature vectors, we use pandas (McK- (D’Amiano et al., 2015; 2019) also present solutions that use inney, 2010) and scikit-learn (Pedregosa et al., 2011). As 3D PatchMatch (Barnes et al., 2009) for video forgery detec- for the training and testing of the SVM, the random forest, tion and localization, whereas (D’Avino et al., 2017) suggest and the ensemble, we use the implementations available in using data-driven machine learning based approaches. So- the scikit-learn library. lutions tailored to detecting the latest video manipulation Figure 2b shows how our method would work in practice. techniques have also been recently presented. These include Given a suspect video, we extract its stream descriptors and the works of (Li et al., 2018;G uera¨ & Delp, 2018) on detect- generate its corresponding feature vector, which is normal- ing Deepfakes and (Rossler¨ et al., 2018; Matern et al., 2019) ized based on the values learnt during the training phase. on Face2Face (Thies et al., 2016) manipulation detection. Since some of the descriptor fields are optional, we perform As covered by (Milani et al., 2012), image-based forensics additional post-processing to ensure that the feature vector techniques that leverage camera noise residuals (Khanna can be processed by our trained detector. Concretely, if any et al., 2008), image compression artifacts (Bianchi & Piva, field is missing in the video stream descriptors, we perform 2012), or geometric and physics inconsistencies in the data imputation by mapping missing fields to a fixed numer- scene (Bulan et al., 2009) can also be used in videos when ical value. If previously unseen descriptor fields are present applied frame by frame. In (Fan et al., 2011) and (Huh in the suspect video stream, they are ignored and not in- et al., 2018), Exif image metadata is used to detect either cluded in the corresponding suspect feature vector. Finally, image brightness and contrast adjustments, and splicing ma- the trained detector analyzes the suspect feature vector and nipulations in images, respectively. Finally, (Iuliani et al., computes a manipulation probability. 2019) use video file container metadata for video integrity It is important to note that although our approach may be verification and source device identification. To the best vulnerable to video re-encoding attacks, this is traded off for of our knowledge, video manipulation detection techniques scalability, a limited need of labeled data, and a high video that exploit the multimedia stream descriptors have not been manipulation detection score, as we present in Section4. previously proposed. Also, the fact that our solution is orthogonal to pixel-based methods and requires limited amounts of data, which means 1https://github.com/dguera/fake-video- detection-without-pixels that ideally, we could use both approaches simultaneously. We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

H.264 / AVC / MPEG-4 H.264AVC / AVC/ MPEG / MPEG-4 part-4 10 Stream H.264AVC / AVC/ MPEG / MPEG-4 part-4 10 H.264AVC / AVC/ MPEG / MPEG-4 part-4 10 Main Descriptor Extraction AVC / MPEG-4 part 10 High High 4:4:4 Predictive 0.02 Manipulation Trained High0.0208542 + 0.016667yuv420p Detector Manipulation 0.0208542yuv420p Feature Vector gbrp 31 Training Detector yuv420p 40 9.07642e+0650 Construction 4.46692e+0640 2.0245e+08 4.46692e+06

Labeled Videos Labeled Feature Vectors

(a)

H.264 / AVC / MPEG-4 Stream AVC / MPEG-4 part 10 Descriptor Extraction High Trained Manipulation + 0.0166833 Manipulation yuv420p Probability Feature Vector 31 Detector Construction 471201

Suspect Video Suspect Feature Vector

(b) Figure 2. (a) Block diagram of the training stage of our proposed method. We process a labeled database of manipulated and pristine videos to generate a feature vector for each video from its multimedia stream descriptors. These feature vectors are then used to train and select the best detector (b) Block diagram of the testing stage of our proposed method. Given a suspect video, a feature vector is generated and processed by the previously selected detector. Finally, a manipulation probability for the suspect video is reported.

Our approach could be used to quickly identify manipulated 4.2. Experimental Setup videos, minimizing the need to rely on human annotation. To show the merits of our method in terms of scalability Later, these newly labeled videos could be used to improve and limited compute requirements, we design the following the performance of pixel-based video manipulation detec- experiment. First, we select machine learning binary clas- tors. Finally, following the recommendations of (Brundage sifiers that are well known for their modeling capabilities, et al., 2018), we want to reflect on a potential misuse of the even with limited access to training samples. As previously proposed approach. We believe that our approach could be mentioned, we use a random forest, a support vector ma- misused by someone with access to large amounts of labeled chine, and a soft voting classification ensemble with both. video data. Using that information, a malevolent adversary This final ensemble is weighted 4 to 1 in favor of the deci- could identify specific individuals, such as journalists or con- sion of the random forest. Then, to show the performance of fidential informants, who may submit anonymous videos each detector under different data availability scenarios, we using the same devices they use to upload videos to social train them using 10%, 25%, 50% and 75% of the available media websites. To avoid this, different physical devices or training data. We use a stratified shuffle splitting policy to proper video data sanitization should be used. select these training subsets, meaning that the global ratio of manipulated to non-manipulated videos of the entire training 4. Experimental Results set is preserved in the subsets. In all scenarios, a sequestered 25% subset of the training data is used for hyper-parameter 4.1. Datasets selection and validation. Finally, the best validated model is In order to evaluate the performance of our proposed ap- selected for testing. Due to the imbalance of manipulated to proach, we use the Media Forensics Challenge (MFC) non-manipulated videos, we use the Precision-Recall (PR) datasets (Guan et al., 2019). Collected by the National curve as our evaluation metric, as recommended by (Saito Institute of Standards and Technology (NIST), this data & Rehmsmeier, 2015). We also report the F1 score, the comprises over 11,000 high provenance videos and 4,000 area under the curve (AUC) score, and the average precision manipulated videos. In our experiments, we use the videos (AP) score for each classifier. from the following datasets for training, hyper-parameter selection, and validation: the Nimble Challenge 2017 de- 4.3. Results and Discussion velopment dataset, the MFC18 development version 1 and version 2 datasets, and the MFC18 GAN dataset. This rep- As we can see in Figure3, Figure4, Figure5, and Figure6, resents a total of 677 videos, of which 167 are manipulated. under all scenarios the voting ensemble of the random forest For testing our model, we use the MFC18 evaluation dataset and the support vector machine generally achieves the best and the MFC19 validation dataset, which have a total of overall results, followed by the random forest and the SVM. 1,097 videos. Of those videos, 336 have been manipulated. More specifically, our best ensemble model achieves a F1 We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

1.0 1.0

0.9 0.9

0.8 0.8

Baseline (F1=0.306, AUC=0.306, AP=0.306) Baseline (F1=0.306, AUC=0.306, AP=0.306) 0.7 0.7 SVM (F1=0.852, AUC=0.943, AP=0.943) SVM (F1=0.884, AUC=0.936, AP=0.936) Random Forest (F1=0.917, AUC=0.981, AP=0.981) Random Forest (F1=0.880, AUC=0.951, AP=0.951) 0.6 0.6 Precision Ensemble (F1=0.917, AUC=0.984, AP=0.984) Precision Ensemble (F1=0.880, AUC=0.955, AP=0.955)

0.5 0.5

0.4 0.4

0.3 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Figure 3. PR curves, F1 score, AUC score, and AP score on the test Figure 5. PR curves, F1 score, AUC score, and AP score on the test set for all the trained models using 10% of the available training set for all the trained models using 50% of the available training data (68 videos). data (339 videos).

1.0 1.0 0.9 0.9 0.8 0.8 Baseline (F1=0.306, AUC=0.306, AP=0.306) 0.7 Baseline (F1=0.306, AUC=0.306, AP=0.306) SVM (F1=0.853, AUC=0.932, AP=0.932) 0.7 SVM (F1=0.885, AUC=0.937, AP=0.937) Random Forest (F1=0.917, AUC=0.981, AP=0.981) 0.6 Random Forest (F1=0.884, AUC=0.961, AP=0.961) Precision Ensemble (F1=0.917, AUC=0.984, AP=0.984) 0.6 Precision Ensemble (F1=0.884, AUC=0.965, AP=0.965) 0.5 0.5 0.4 0.4 0.3 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Figure 6. PR curves, F1 score, AUC score, and AP score on the test Figure 4. PR curves, F1 score, AUC score, and AP score on the test set for all the trained models using 75% of the available training set for all the trained models using 25% of the available training data (508 videos). data (169 videos). 5. Conclusion score of 0.917, an AUC score of 0.984 and an AP score of Up until now, most video manipulation detection techniques 0.984. To contextualize these results, we have included the have focused on analyzing the pixel data to spot forged performance of a binary classifier baseline which predicts a content. In this paper, we have shown how simple machine video manipulation with probability p = 0.306. This corre- learning classifiers can be highly effective at detecting video sponds to the true fraction of manipulated videos in the test manipulations when the appropriate data is used. More set. Note that it is higher than the fraction of manipulated specifically, we use an ensemble of a random forest and an videos in the training subsets, which is 0.247. This baseline SVM trained on multimedia stream descriptors from both model would achieve an F1, AUC, and AP score of 0.306. forged and pristine videos. With this approach, we have We can see that our best model is three times better than achieved an extremely high video manipulation detection the baseline in all reported metrics. Notice that, as seen in score while requiring very limited amounts of data. Based Figure3, the ensemble trained with 68 videos has achieved on our findings, our future work will focus on techniques equal or better results than the ensembles trained with more that automatically perform data sanitization. This will allow videos. This shows that, even with a very limited number us to remove metadata and auxiliary header information that of stream descriptors, a properly tuned machine learning may give away sensitive information such as the source of model can be trained easily to spot video manipulations. the video. We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

Acknowledgements (12):2144–2154, December 2014. URL https://doi. org/10.1109/TCSVT.2014.2334031. This material is based on research sponsored by DARPA and Air Force Research Laboratory (AFRL) under agreement Bianchi, T. and Piva, A. Image forgery localization via number FA8750-16-2-0173. The U.S. Government is autho- block-grained analysis of jpeg artifacts. IEEE Trans- rized to reproduce and distribute reprints for Governmental actions on Information Forensics and Security, 7(3): purposes notwithstanding any copyright notation thereon. 1003–1017, June 2012. URL https://doi.org/ The views and conclusions contained herein are those of 10.1109/TIFS.2012.2187516. the authors and should not be interpreted as necessarily representing the official policies or endorsements, either Bird, M. The video in which Greece’s finance expressed or implied, of DARPA and Air Force Research minister gives Germany the finger has several Laboratory (AFRL) or the U.S. Government. bizarre new twists. March 2015. URL https: //www.businessinsider.com/yanis- varoufakis-middle-finger-controversy- References real-fake-bohmermann-jauch-2015-3. Bansal, A., Ma, S., Ramanan, D., and Sheikh, Y. Recycle- (Accessed on 04/17/2019). GAN: Unsupervised video retargeting. Proceedings of Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., the European Conference on Computer Vision, pp. 119– Garfinkel, B., Dafoe, A., Scharre, P., Zeitzoff, T., Filar, B., 135, September 2018. URL https://doi.org/10. Anderson, H., Roff, H., Allen, G. C., Steinhardt, J., Flynn, 1007/978-3-030-01228-1_8. Munich, Germany. C., hEigeartaigh,´ S. O.,´ Beard, S., Belfield, H., Farquhar, Barnes, C., Shechtman, E., Finkelstein, A., and Goldman, S., Lyle, C., Crootof, R., Evans, O., Page, M., Bryson, D. B. Patchmatch: A randomized correspondence algo- J., Yampolskiy, R., and Amodei, D. The malicious use rithm for structural image editing. ACM Transactions on of artificial intelligence: Forecasting, prevention, and Graphics, 28(3):24:1–24:11, July 2009. URL https: mitigation. arXiv:1802.07228v1, February 2018. URL //doi.org/10.1145/1531326.1531330. http://arxiv.org/abs/1802.07228v1. Bulan, O., Mao, J., and Sharma, G. Geometric dis- Bayram, S., Sencar, H. T., and Memon, N. Video tortion signatures for printer identification. Proceed- copy detection based on source device characteristics: ings of the IEEE International Conference on Acous- A complementary approach to content-based methods. tics, Speech and Signal Processing, pp. 1401–1404, Proceedings of the ACM International Conference on April 2009. URL https://doi.org/10.1109/ Multimedia Information Retrieval, pp. 435–442, Oc- ICASSP.2009.4959855. tober 2008. URL https://doi.org/10.1145/ 1460096.1460167. Vancouver, British Columbia, Chesney, R. and Citron, D. K. Disinformation Canada. on steroids: The threat of deep fakes. October 2018. URL https://www.cfr.org/report/ Bellard, F. et al. ffprobe documentation. April 2019. URL deep-fake-disinformation-steroids. (Ac- https://www.ffmpeg.org/ffprobe.html . cessed on 04/17/2019). (Accessed on 04/17/2019). Cole, S. Fake porn makers are worried about accidentally Bestagini, P., Milani, S., Tagliasacchi, M., and Tubaro, S. making child porn. February 2018. URL https: Local tampering detection in video sequences. Proceed- //motherboard.vice.com/en_us/article/ ings of the IEEE International Workshop on Multime- evmkxa/ai-fake-porn-deepfakes-child- dia Signal Processing, pp. 488–493, September 2013. pornography-emma-watson-elle-fanning. URL https://doi.org/10.1109/MMSP.2013. (Accessed on 04/17/2019). 6659337. Pula, Italy. Curtis, C. Deepfakes are being weaponized to silence Bestagini, P., Milani, S., Tagliasacchi, M., and Tubaro, women but this woman is fighting back. October S. Codec and gop identification in double compressed 2018. URL https://thenextweb.com/code- videos. IEEE Transactions on Image Processing, 25(5): word/2018/10/05/deepfakes-are-being- 2298–2310, May 2016. URL https://doi.org/10. weaponized-to-silence-women-but-this- 1109/TIP.2016.2541960. woman-is-fighting-back/. (Accessed on 04/17/2019). Bian, S., Luo, W., and Huang, J. Exposing fake bit rate videos and estimating original bit rates. IEEE Transac- D’Amiano, L., Cozzolino, D., Poggi, G., and Verdoliva, tions on Circuits and Systems for Video Technology, 24 L. Video forgery detection and localization based on 3d We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

PatchMatch. Proceedings of the IEEE International Con- Iuliani, M., Shullani, D., Fontani, M., Meucci, S., and Piva, ference on Multimedia Expo Workshops, pp. 1–6, June A. A video forensic framework for the unsupervised 2015. URL https://doi.org/10.1109/ICMEW. analysis of MP4-like file container. IEEE Transactions 2015.7169805. Turin, Italy. on Information Forensics and Security, 14(3):635–645, March 2019. URL https://doi.org/10.1109/ D’Amiano, L., Cozzolino, D., Poggi, G., and Verdoliva, TIFS.2018.2859760. L. A PatchMatch-based dense-field algorithm for video copymove detection and localization. IEEE Transactions Jack, K. Chapter 13 - MPEG-2. In Jack, K. on Circuits and Systems for Video Technology, 29(3): (ed.), Video Demystified: A Handbook for the Digi- 669–682, March 2019. URL https://doi.org/10. tal Engineer, pp. 577–737. Newnes, Burlington, MA, 1109/TCSVT.2018.2804768. 2007. URL https://doi.org/10.1016/B978- 075068395-1/50013-4. D’Avino, D., Cozzolino, D., Poggi, G., and Verdo- Khanna, N., Chiu, G. T. ., Allebach, J. P., and Delp, E. J. liva, L. Autoencoder with recurrent neural networks Forensic techniques for classifying scanner, computer for video forgery detection. Proceedings of the generated and digital camera images. pp. 1653–1656, IS&T Electronic Imaging, 2017(7):92–99, January 2017. March 2008. URL https://doi.org/10.1109/ URL https://doi.org/10.2352/ISSN.2470- ICASSP.2008.4517944. Las Vegas, NV. 1173.2017.7.MWSF-330. Burlingame, CA. Korshunov, P. and Marcel, S. Deepfakes: a new Fan, J., Kot, A. C., Cao, H., and Sattar, F. Modeling threat to face recognition? assessment and detection. the exif-image correlation for image manipulation de- arXiv:1812.08685v1, March 2018. URL https:// tection. Proceedings of the IEEE International Confer- arxiv.org/abs/1812.08685v1. ence on Image Processing, pp. 1945–1948, September 2011. URL https://doi.org/10.1109/ICIP. Korshunova, I., Shi, W., Dambre, J., and Theis, L. Fast 2011.6115853. Brussels, Belgium. face-swap using convolutional neural networks. Proceed- ings of the IEEE International Conference on Computer Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A. N., Vision, pp. 3697–3705, October 2017. URL https: Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., and Fis- //doi.org/10.1109/ICCV.2017.397. Venice, cus, J. Mfc datasets: Large-scale benchmark datasets for Italy. media forensic challenge evaluation. Proceedings of the IEEE Winter Applications of Computer Vision Workshops, Lameri, S., Bondi, L., Bestagin, P., and Tubaro, S. Near- pp. 63–72, January 2019. URL https://doi.org/ duplicate video detection exploiting noise residual traces. 10.1109/WACVW.2019.00018. Waikoloa Village, Proceedings of the IEEE International Conference on HI. Image Processing, pp. 1497–1501, September 2017. URL https://doi.org/10.1109/ICIP.2017. Guera,¨ D. and Delp, E. J. video detection 8296531. Beijing, China. using recurrent neural networks. Proceedings of the Li, Y., Chang, M., and Lyu, S. In ictu oculi: Expos- IEEE International Conference on Advanced Video and ing AI created fake videos by detecting eye blinking. Signal Based Surveillance, pp. 1–6, November 2018. Proceedings of the IEEE International Workshop on In- URL https://doi.org/10.1109/AVSS.2018. formation Forensics and Security, pp. 1–7, December 8639163. Auckland, New Zealand. 2018. URL https://doi.org/10.1109/WIFS. 2018.8630787. Hong Kong, China. Hsu, C.-C., Hung, T.-Y., Lin, C.-W., and Hsu, C.- T. Video forgery detection using correlation of noise Mandelli, S., Bestagini, P., Tubaro, S., Cozzolino, residue. Proceedings of IEEE Workshop on Multi- D., and Verdoliva, L. Blind detection and local- media Signal Processing, pp. 170–174, October 2008. ization of video temporal splicing exploiting sensor- URL https://doi.org/10.1109/MMSP.2008. based footprints. Proceedings of the European Sig- 4665069. Cairns, Qld, Australia. nal Processing Conference, pp. 1362–1366, Septem- ber 2018. URL https://doi.org/10.23919/ Huh, M., Liu, A., Owens, A., and Efros, A. A. Fight- EUSIPCO.2018.8553511. Rome, Italy. ing : Image splice detection via learned self- consistency. Proceedings of the European Conference Matern, F., Riess, C., and Stamminger, M. Exploiting on Computer Vision, pp. 106–124, September 2018. visual artifacts to expose deepfakes and face manipu- URL https://doi.org/10.1007/978-3-030- lations. Proceedings of the IEEE Winter Applications 01252-6_7. Munich, Germany. of Computer Vision Workshops, pp. 83–92, January We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

2019. URL https://doi.org/10.1109/WACVW. Thies, J., Zollhofer,¨ M., Stamminger, M., Theobalt, C., and 2019.00020. Waikoloa Village, HI. Nießner, M. Face2Face: Real-time face capture and reenactment of RGB videos. Proceedings of the IEEE McKinney, W. Data structures for statistical com- Conference on Computer Vision and Pattern Recognition, puting in python. Proceedings of the Python pp. 2387–2395, June 2016. URL https://doi.org/ in Science Conference, pp. 51–56, June 2010. 10.1109/CVPR.2016.262. Las Vegas, NV. URL http://conference.scipy.org/ proceedings/scipy2010/mckinney.html. Vincent, J. US lawmakers say AI deepfakes have Austin, TX. the potential to disrupt every facet of our society. September 2018. URL https://www.theverge. Milani, S., Fontani, M., Bestagini, P., Barni, M., Piva, com/2018/9/14/17859188/ai-deepfakes- A., Tagliasacchi, M., and Tubaro, S. An overview national-security-threat-lawmakers- on video forensics. APSIPA Transactions on Signal letter-intelligence-community. (Accessed and Information Processing, 1:e2, August 2012. URL on 04/17/2019). https://doi.org/10.1017/ATSIP.2012.2. Zannettou, S., Chatzis, S., Papadamou, K., and Siriv- Mullan, P., Cozzolino, D., Verdoliva, L., and Riess, C. ianos, M. The good, the bad and the bait: Detecting Residual-based forensic comparison of video sequences. and characterizing clickbait on youtube. Proceedings Proceedings of the IEEE International Conference on of the IEEE Security and Privacy Workshops, pp. 63– Image Processing, pp. 1507–1511, September 2017. 69, May 2018. URL https://doi.org/10.1109/ URL https://doi.org/10.1109/ICIP.2017. SPW.2018.00018. San Francisco, CA. 8296533. Beijing, China.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, Novem- ber 2011. URL http://dl.acm.org/citation. cfm?id=1953048.2078195.

Rossler,¨ A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., and Nießner, M. Faceforensics: A large- scale video dataset for forgery detection in human faces. arXiv:1803.09179, March 2018. URL https: //arxiv.org/abs/1803.09179.

Saito, T. and Rehmsmeier, M. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10(3):e0118432, March 2015. URL https://doi. org/10.1371/journal.pone.0118432.

Stamm, M. C., Lin, W. S., and Liu, K. J. R. Forensics vs. anti-forensics: A decision and game theoretic framework. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1749–1752, March 2012a. URL https://doi.org/10.1109/ ICASSP.2012.6288237. Kyoto, Japan.

Stamm, M. C., Lin, W. S., and Liu, K. J. R. Tempo- ral forensics and anti-forensics for motion compensated video. IEEE Transactions on Information Forensics and Security, 7(4):1315–1329, August 2012b. URL https: //doi.org/10.1109/TIFS.2012.2205568.