Video Manipulation Detection Using Stream Descriptors

Total Page:16

File Type:pdf, Size:1020Kb

Video Manipulation Detection Using Stream Descriptors We Need No Pixels: Video Manipulation Detection Using Stream Descriptors David Guera¨ 1 Sriram Baireddy 1 Paolo Bestagini 2 Stefano Tubaro 2 Edward J. Delp 1 Abstract Manipulating video content is easier than ever. Due to the misuse potential of manipulated con- video@ H.264 / AVC / MPEG-4 H.264 / AVC / MPEG-4 H.264 / AVC / MPEG-4 codec_long_name AVC / MPEG-4 part 10 AVC / MPEG-4 part 10 AVC / MPEG-4 part 10 tent, multiple detection techniques that analyze video@profile Main High 4:4:4 Predictive High video@ the pixel data from the videos have been pro- 0.02 0.016667 0.0208542 posed. However, clever manipulators should also codec_time_base video@pix_fmt yuv420p gbrp yuv420p carefully forge the metadata and auxiliary header video@level 31 50 40 information, which is harder to do for videos video@bit_rate 9.07642e+06 2.0245e+08 4.46692e+06 than images. In this paper, we propose to iden- tify forged videos by analyzing their multimedia Figure 1. Examples of some of the information extracted from stream descriptors with simple binary classifiers, the video stream descriptors. These descriptors are necessary to completely avoiding the pixel space. Using well- decode and playback a video. known datasets, our results show that this scalable approach can achieve a high manipulation detec- tion score if the manipulators have not done a Due to the ever increasing sophistication of these techniques, careful data sanitization of the multimedia stream uncovering manipulations in videos remains an open prob- descriptors. lem. Existing video manipulation detection solutions focus entirely on the observance of anomalies in the pixel domain of the video. Unfortunately, it can be easily seen from a 1. Introduction game theoretic perspective that, if both manipulators and detectors are equally powerful, a Nash equilibrium will be Video manipulation is now within reach of any individual. reached (Stamm et al., 2012a). Under that scenario, both Recent improvements in the machine learning field have real and manipulated videos will be indistinguishable from enabled the creation of powerful video manipulation tools. each other, and the best detector will only be capable of ran- Face2Face (Thies et al., 2016), Recycle-GAN (Bansal et al., dom guessing. Hence, methods that look beyond the pixel 2018), Deepfakes (Korshunov & Marcel, 2018), and other domain are critically needed. So far, little attention has been face swapping techniques (Korshunova et al., 2017) embody paid to the necessary metadata and auxiliary header infor- the latest generation of these open source video forging mation that is embedded in every video. As we shall present, methods. It is assumed as a certainty both by the research this information can be exploited to uncover unskilled video community (Brundage et al., 2018) and governments across content manipulators. the globe (Vincent, 2018; Chesney & Citron, 2018) that arXiv:1906.08743v1 [cs.LG] 20 Jun 2019 more complex tools will appear in the near future. Classical In this paper, we introduce a new approach to address the and current video editing methods have already demon- video manipulation detection problem. To avoid the zero- strated dangerous potential, having been used to generate sum, leader-follower game that characterizes current detec- political propaganda (Bird, 2015), revenge-porn (Curtis, tion solutions, our approach completely avoids the pixel 2018), and child-exploitation material (Cole, 2018). domain. Instead, we use the multimedia stream descrip- tors (Jack, 2007) that ensure the playback of any video (as 1Video and Image Processing Laboratory (VIPER), Purdue shown in Figure1). First, we construct a feature vector with University, West Lafayette, Indiana, USA 2Dipartimento di all the descriptor information for a given video. Using a Informazione, Elettronica e Bioingegneria, Politecnico di Mi- database of known manipulated videos, we train an ensem- lano, Milano, Italy. Correspondence to: David Guera¨ <dguer- ble of a support vector machine and a random forest that [email protected]>, Edward J. Delp <[email protected]>. acts as our detector. Finally, during testing, we generate the Proceedings of the 36 th International Conference on Machine feature vector from the stream descriptors of the video under Learning, Long Beach, California, PMLR 97, 2019. Copyright analysis, feed it to the ensemble, and report a manipulation 2019 by the author(s). probability. We Need No Pixels: Video Manipulation Detection Using Stream Descriptors The contributions of this paper are summarized as follows. 3. Proposed Method First, we introduce a new technique that does not require access to the pixel content of the video, making it fast and Current video manipulation detection approaches rely on un- scalable, even on consumer grade computing equipment. covering manipulations by studying pixel domain anomalies. Instead, we rely on the multimedia descriptors present on Instead, we propose to use the multimedia stream descrip- any video, which are considerably harder to manipulate due tors of videos as our main source of information to spot to their role in the decoding phase. Second, we thoroughly manipulated content. To do so, our method works in two test our approach using the NIST MFC datasets (Guan et al., stages, as presented in Figure2. First, during the training 2019) and show that even with a limited amount of labeled phase, we extract the multimedia stream descriptors from videos, simple machine learning ensembles can be highly a labeled database of manipulated and pristine videos. In effective detectors. Finally, all of our code and trained clas- practice, such a database can be easily constructed using sifiers will be made available1 so the research community a limited amount of manually labeled data coupled with a can reproduce our work with their own datasets. semi-supervised learning approach, as done by (Zannettou et al., 2018). Then, we encode these descriptors as a feature vector for each given video. We apply median normaliza- 2. Related Work tion to all numerical features. As for categorical features, The multimedia forensics research community has a long each is encoded as its own unique numerical value. Once history of trying to address the problem of detecting manip- we have processed all the videos in the database, we use ulations in video sequences. (Milani et al., 2012) provide all the feature vectors to train different binary classifiers as an extensive and thorough overview of the main research our detectors. More specifically, we use a random forest, directions and solutions that have been explored in the last a support vector machine (SVM) and an ensemble of both decade. More recent work has focused on specific video detectors. The best hyperparameters for each detector are manipulations, such as local tampering detection in video se- selected by performing a random search cross-validation quences (Stamm et al., 2012b; Bestagini et al., 2013), video over a 10-split stratified shuffling of the data and 1;000 tri- re-encoding detection (Bian et al., 2014; Bestagini et al., als per split. Figure 2a summarizes this first stage. In our 2016), splicing detection in videos (Hsu et al., 2008; Mul- implementation, we use ffprobe (Bellard et al., 2019) for the lan et al., 2017; Mandelli et al., 2018), and near-duplicate multimedia stream descriptor extraction. For the encoding video detection (Bayram et al., 2008; Lameri et al., 2017). of the descriptors as feature vectors, we use pandas (McK- (D’Amiano et al., 2015; 2019) also present solutions that use inney, 2010) and scikit-learn (Pedregosa et al., 2011). As 3D PatchMatch (Barnes et al., 2009) for video forgery detec- for the training and testing of the SVM, the random forest, tion and localization, whereas (D’Avino et al., 2017) suggest and the ensemble, we use the implementations available in using data-driven machine learning based approaches. So- the scikit-learn library. lutions tailored to detecting the latest video manipulation Figure 2b shows how our method would work in practice. techniques have also been recently presented. These include Given a suspect video, we extract its stream descriptors and the works of (Li et al., 2018;G uera¨ & Delp, 2018) on detect- generate its corresponding feature vector, which is normal- ing Deepfakes and (Rossler¨ et al., 2018; Matern et al., 2019) ized based on the values learnt during the training phase. on Face2Face (Thies et al., 2016) manipulation detection. Since some of the descriptor fields are optional, we perform As covered by (Milani et al., 2012), image-based forensics additional post-processing to ensure that the feature vector techniques that leverage camera noise residuals (Khanna can be processed by our trained detector. Concretely, if any et al., 2008), image compression artifacts (Bianchi & Piva, field is missing in the video stream descriptors, we perform 2012), or geometric and physics inconsistencies in the data imputation by mapping missing fields to a fixed numer- scene (Bulan et al., 2009) can also be used in videos when ical value. If previously unseen descriptor fields are present applied frame by frame. In (Fan et al., 2011) and (Huh in the suspect video stream, they are ignored and not in- et al., 2018), Exif image metadata is used to detect either cluded in the corresponding suspect feature vector. Finally, image brightness and contrast adjustments, and splicing ma- the trained detector analyzes the suspect feature vector and nipulations in images, respectively. Finally, (Iuliani et al., computes a manipulation probability. 2019) use video file container metadata for video integrity It is important to note that although our approach may be verification and source device identification. To the best vulnerable to video re-encoding attacks, this is traded off for of our knowledge, video manipulation detection techniques scalability, a limited need of labeled data, and a high video that exploit the multimedia stream descriptors have not been manipulation detection score, as we present in Section4.
Recommended publications
  • Proposed Framework for Digital Video Authentication
    PROPOSED FRAMEWORK FOR DIGITAL VIDEO AUTHENTICATION by GREGORY SCOTT WALES A.S., Community College of the Air Force, 1990 B.S., Champlain College, 2012 M.S., Champlain College, 2015 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Recording Arts 2019 © 2019 GREGORY SCOTT WALES ALL RIGHTS RESERVED ii This thesis for the Master of Science degree by Gregory Scott Wales has been approved for the Recording Arts Program by Catalin Grigoras, Chair Jeffrey M. Smith Marcus Rogers Date: May 18, 2019 iii Wales, Gregory Scott (M.S., Recording Arts Program) Proposed Framework for Digital Video Authentication Thesis directed by Associate Professor Catalin Grigoras. ABSTRACT One simply has to look at news outlets or social media to see our society video records events from births to deaths and everything in between. A trial court’s acceptance of videos supporting administrative hearings, civil litigation, and criminal cases is based on a foundation that the videos offered into evidence are authentic; however, technological advancements in video editing capabilities provide an easy method to edit digital videos. The proposed framework offers a structured approach to evaluate and incorporate methods, existing and new, that come from scientific research and publication. The thesis offers a quick overview of digital video file creation chain (including factors that influence the final file), general description of the digital video file container, and description of camera sensor noises. The thesis addresses the overall development and proposed use of the framework, previous research of analysis methods / techniques, testing of the methods / techniques, and an overview of the testing results.
    [Show full text]
  • Information Warfare
    Memes That Kill: The Future Of Information Warfare I Memes and social networks have become weaponized, while many governments seem ill-equipped to understand the new reality of information warfare. How will we fight state-sponsored disinformation and propaganda in the future? In 2011, a university professor with a background in robotics presented an idea that seemed radical at the time. After conducting research backed by DARPA — the same defense agency that helped spawn the internet — Dr. Robert Finkelstein proposed the creation of a brand new arm of the US military, a “Meme Control Center.” In internet-speak the word “meme” often refers to an amusing picture that goes viral on social media. More broadly, however, a meme is any idea that spreads, whether that idea is true or false. It is this broader definition of meme that Finklestein had in mind when he proposed the Meme Control Center and his idea of “memetic warfare.” II If possible, images should fit within the bounds of the column, but legibility is first priority. See AI Trends PDF for examples of images. Here’s an image caption for source attribution or description if needed. From “Tutorial: Military Memetics,” by Dr. Robert Finkelstein, presented at Social Media for Defense Summit, 2011 Basically, Dr. Finklestein’s Meme Control Center would pump the internet full of “memes” that would benefit the national security of the United States. Finkelstein saw a future in which guns and bombs are replaced by rumor, digital fakery, and social engineering. Fast forward seven years, and Dr. Finklestein’s ideas don’t seem radical at all.
    [Show full text]
  • Tackling Deepfakes in European Policy
    Tackling deepfakes in European policy STUDY Panel for the Future of Science and Technology EPRS | European Parliamentary Research Service Scientific Foresight Unit (STOA) PE 690.039 – July 2021 EN Tackling deepfakes in European policy The emergence of a new generation of digitally manipulated media capable of generating highly realistic videos – also known as deepfakes – has generated substantial concerns about possible misuse. In response to these concerns, this report assesses the technical, societal and regulatory aspects of deepfakes. The assessment of the underlying technologies for deepfake videos, audio and text synthesis shows that they are developing rapidly, and are becoming cheaper and more accessible by the day. The rapid development and spread of deepfakes is taking place within the wider context of a changing media system. An assessment of the risks associated with deepfakes shows that they can be psychological, financial and societal in nature, and their impacts can range from the individual to the societal level. The report identifies five dimensions of the deepfake lifecycle that policy-makers could take into account to prevent and address the adverse impacts of deepfakes. The legislative framework on artificial intelligence (AI) proposed by the European Commission presents an opportunity to mitigate some of these risks, although regulation should not focus on the technological dimension of deepfakes alone. The report includes policy options under each of the five dimensions, which could be incorporated into the AI legislative framework, the proposed European Union digital services act package and beyond. A combination of measures will likely be necessary to limit the risks of deepfakes, while harnessing their potential.
    [Show full text]
  • Deepfakes Generation and Detection: State-Of-The-Art, Open Challenges, Countermeasures, and Way Forward
    Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward Momina Masood1, Mariam Nawaz2, Khalid Mahmood Malik3, Ali Javed4, Aun Irtaza5 1,2Department of Computer Science, University of Engineering and Technology-Taxila, Pakistan 3,4Department of Computer Science and Engineering, Oakland University, Rochester, MI, USA 5Electrical and Computer Engineering Department, University of Michigan-Dearborn, MI, USA Abstract Easy access to audio-visual content on social media, combined with the availability of modern tools such as Tensorflow or Keras, open-source trained models, and economical computing infrastructure, and the rapid evolution of deep-learning (DL) methods, especially Generative Adversarial Networks (GAN), have made it possible to generate deepfakes to disseminate disinformation, revenge porn, financial frauds, hoaxes, and to disrupt government functioning. The existing surveys have mainly focused on deepfake video detection only. No attempt has been made to review approaches for detection and generation of both audio and video deepfakes. This paper provides a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for deepfake generation and the methodologies used to detect such manipulations for the detection and generation of both audio and video deepfakes. For each category of deepfake, we discuss information related to manipulation approaches, current public datasets, and key standards for the performance evaluation of deepfake detection techniques along with their results. Additionally, we also discuss open challenges and enumerate future directions to guide future researchers on issues that need to be considered to improve the domains of both the deepfake generation and detection. This work is expected to assist the readers in understanding the creation and detection mechanisms of deepfake, along with their current limitations and future direction.
    [Show full text]