Video Manipulation Detection Using Stream Descriptors
Total Page:16
File Type:pdf, Size:1020Kb
We Need No Pixels: Video Manipulation Detection Using Stream Descriptors David Guera¨ 1 Sriram Baireddy 1 Paolo Bestagini 2 Stefano Tubaro 2 Edward J. Delp 1 Abstract Manipulating video content is easier than ever. Due to the misuse potential of manipulated con- video@ H.264 / AVC / MPEG-4 H.264 / AVC / MPEG-4 H.264 / AVC / MPEG-4 codec_long_name AVC / MPEG-4 part 10 AVC / MPEG-4 part 10 AVC / MPEG-4 part 10 tent, multiple detection techniques that analyze video@profile Main High 4:4:4 Predictive High video@ the pixel data from the videos have been pro- 0.02 0.016667 0.0208542 posed. However, clever manipulators should also codec_time_base video@pix_fmt yuv420p gbrp yuv420p carefully forge the metadata and auxiliary header video@level 31 50 40 information, which is harder to do for videos video@bit_rate 9.07642e+06 2.0245e+08 4.46692e+06 than images. In this paper, we propose to iden- tify forged videos by analyzing their multimedia Figure 1. Examples of some of the information extracted from stream descriptors with simple binary classifiers, the video stream descriptors. These descriptors are necessary to completely avoiding the pixel space. Using well- decode and playback a video. known datasets, our results show that this scalable approach can achieve a high manipulation detec- tion score if the manipulators have not done a Due to the ever increasing sophistication of these techniques, careful data sanitization of the multimedia stream uncovering manipulations in videos remains an open prob- descriptors. lem. Existing video manipulation detection solutions focus entirely on the observance of anomalies in the pixel domain of the video. Unfortunately, it can be easily seen from a 1. Introduction game theoretic perspective that, if both manipulators and detectors are equally powerful, a Nash equilibrium will be Video manipulation is now within reach of any individual. reached (Stamm et al., 2012a). Under that scenario, both Recent improvements in the machine learning field have real and manipulated videos will be indistinguishable from enabled the creation of powerful video manipulation tools. each other, and the best detector will only be capable of ran- Face2Face (Thies et al., 2016), Recycle-GAN (Bansal et al., dom guessing. Hence, methods that look beyond the pixel 2018), Deepfakes (Korshunov & Marcel, 2018), and other domain are critically needed. So far, little attention has been face swapping techniques (Korshunova et al., 2017) embody paid to the necessary metadata and auxiliary header infor- the latest generation of these open source video forging mation that is embedded in every video. As we shall present, methods. It is assumed as a certainty both by the research this information can be exploited to uncover unskilled video community (Brundage et al., 2018) and governments across content manipulators. the globe (Vincent, 2018; Chesney & Citron, 2018) that arXiv:1906.08743v1 [cs.LG] 20 Jun 2019 more complex tools will appear in the near future. Classical In this paper, we introduce a new approach to address the and current video editing methods have already demon- video manipulation detection problem. To avoid the zero- strated dangerous potential, having been used to generate sum, leader-follower game that characterizes current detec- political propaganda (Bird, 2015), revenge-porn (Curtis, tion solutions, our approach completely avoids the pixel 2018), and child-exploitation material (Cole, 2018). domain. Instead, we use the multimedia stream descrip- tors (Jack, 2007) that ensure the playback of any video (as 1Video and Image Processing Laboratory (VIPER), Purdue shown in Figure1). First, we construct a feature vector with University, West Lafayette, Indiana, USA 2Dipartimento di all the descriptor information for a given video. Using a Informazione, Elettronica e Bioingegneria, Politecnico di Mi- database of known manipulated videos, we train an ensem- lano, Milano, Italy. Correspondence to: David Guera¨ <dguer- ble of a support vector machine and a random forest that [email protected]>, Edward J. Delp <[email protected]>. acts as our detector. Finally, during testing, we generate the Proceedings of the 36 th International Conference on Machine feature vector from the stream descriptors of the video under Learning, Long Beach, California, PMLR 97, 2019. Copyright analysis, feed it to the ensemble, and report a manipulation 2019 by the author(s). probability. We Need No Pixels: Video Manipulation Detection Using Stream Descriptors The contributions of this paper are summarized as follows. 3. Proposed Method First, we introduce a new technique that does not require access to the pixel content of the video, making it fast and Current video manipulation detection approaches rely on un- scalable, even on consumer grade computing equipment. covering manipulations by studying pixel domain anomalies. Instead, we rely on the multimedia descriptors present on Instead, we propose to use the multimedia stream descrip- any video, which are considerably harder to manipulate due tors of videos as our main source of information to spot to their role in the decoding phase. Second, we thoroughly manipulated content. To do so, our method works in two test our approach using the NIST MFC datasets (Guan et al., stages, as presented in Figure2. First, during the training 2019) and show that even with a limited amount of labeled phase, we extract the multimedia stream descriptors from videos, simple machine learning ensembles can be highly a labeled database of manipulated and pristine videos. In effective detectors. Finally, all of our code and trained clas- practice, such a database can be easily constructed using sifiers will be made available1 so the research community a limited amount of manually labeled data coupled with a can reproduce our work with their own datasets. semi-supervised learning approach, as done by (Zannettou et al., 2018). Then, we encode these descriptors as a feature vector for each given video. We apply median normaliza- 2. Related Work tion to all numerical features. As for categorical features, The multimedia forensics research community has a long each is encoded as its own unique numerical value. Once history of trying to address the problem of detecting manip- we have processed all the videos in the database, we use ulations in video sequences. (Milani et al., 2012) provide all the feature vectors to train different binary classifiers as an extensive and thorough overview of the main research our detectors. More specifically, we use a random forest, directions and solutions that have been explored in the last a support vector machine (SVM) and an ensemble of both decade. More recent work has focused on specific video detectors. The best hyperparameters for each detector are manipulations, such as local tampering detection in video se- selected by performing a random search cross-validation quences (Stamm et al., 2012b; Bestagini et al., 2013), video over a 10-split stratified shuffling of the data and 1;000 tri- re-encoding detection (Bian et al., 2014; Bestagini et al., als per split. Figure 2a summarizes this first stage. In our 2016), splicing detection in videos (Hsu et al., 2008; Mul- implementation, we use ffprobe (Bellard et al., 2019) for the lan et al., 2017; Mandelli et al., 2018), and near-duplicate multimedia stream descriptor extraction. For the encoding video detection (Bayram et al., 2008; Lameri et al., 2017). of the descriptors as feature vectors, we use pandas (McK- (D’Amiano et al., 2015; 2019) also present solutions that use inney, 2010) and scikit-learn (Pedregosa et al., 2011). As 3D PatchMatch (Barnes et al., 2009) for video forgery detec- for the training and testing of the SVM, the random forest, tion and localization, whereas (D’Avino et al., 2017) suggest and the ensemble, we use the implementations available in using data-driven machine learning based approaches. So- the scikit-learn library. lutions tailored to detecting the latest video manipulation Figure 2b shows how our method would work in practice. techniques have also been recently presented. These include Given a suspect video, we extract its stream descriptors and the works of (Li et al., 2018;G uera¨ & Delp, 2018) on detect- generate its corresponding feature vector, which is normal- ing Deepfakes and (Rossler¨ et al., 2018; Matern et al., 2019) ized based on the values learnt during the training phase. on Face2Face (Thies et al., 2016) manipulation detection. Since some of the descriptor fields are optional, we perform As covered by (Milani et al., 2012), image-based forensics additional post-processing to ensure that the feature vector techniques that leverage camera noise residuals (Khanna can be processed by our trained detector. Concretely, if any et al., 2008), image compression artifacts (Bianchi & Piva, field is missing in the video stream descriptors, we perform 2012), or geometric and physics inconsistencies in the data imputation by mapping missing fields to a fixed numer- scene (Bulan et al., 2009) can also be used in videos when ical value. If previously unseen descriptor fields are present applied frame by frame. In (Fan et al., 2011) and (Huh in the suspect video stream, they are ignored and not in- et al., 2018), Exif image metadata is used to detect either cluded in the corresponding suspect feature vector. Finally, image brightness and contrast adjustments, and splicing ma- the trained detector analyzes the suspect feature vector and nipulations in images, respectively. Finally, (Iuliani et al., computes a manipulation probability. 2019) use video file container metadata for video integrity It is important to note that although our approach may be verification and source device identification. To the best vulnerable to video re-encoding attacks, this is traded off for of our knowledge, video manipulation detection techniques scalability, a limited need of labeled data, and a high video that exploit the multimedia stream descriptors have not been manipulation detection score, as we present in Section4.