Cooperative edge deepfake detection

Main Subject area: Computer Engineering Authors: Enis Hasanaj, William Söder, Albert Aveler Supervisor: Garrit Schaap JÖNKÖPING June 2021

This final thesis has been carried out at the School of Engineering at Jönköping University within Computer Engineering. The authors are responsible for the presented opinions, conclusions, and results.

Examiner: Rachid Oucheikh Supervisor: Garrit Schaap Scope: 15 hp (first-cycle education) Date: 2021-06-12

Abstract ...... 1

1 Introduction ...... 2

1.1 PROBLEM STATEMENT ...... 2

1.1.1 Deepfakes ...... 2

1.1.2 Computational power in machine learning ...... 2

1.2 LITERATURE REVIEW ...... 4

1.2.1 Edge training and federated learning ...... 4

1.2.2 Deepfake detection ...... 4

1.3 PURPOSE AND RESEARCH QUESTIONS ...... 5

1.4 SCOPE AND LIMITATIONS ...... 6

1.5 DISPOSITION ...... 7

2 Method and implementation ...... 8

2.1 DATA COLLECTION ...... 8

2.2 DATA ANALYSIS ...... 9

2.3 VALIDITY AND RELIABILITY ...... 10

2.4 CONSIDERATIONS ...... 10

2.5 PROPOSED SOLUTION OF COOPERATIVE EDGE DEEPFAKE DETECTION ...... 11

3 Theoretical framework ...... 12

3.1 DARKNET YOLOV2 ...... 12

3.1.1 Basics ...... 12

3.1.2 Details ...... 13

3.2 COMBINING THE MODELS ...... 15

3.3 ENSEMBLE METHODS ...... 16

3.3.1 Bagging ...... 16

3.3.2 Boosting ...... 17

3.3.3 Stacking ...... 18

3.4 PROGRAMMING LANGUAGES AND PLATFORMS ...... 18

3.5 DEEPFAKES EXPLAINED ...... 19

3.6 DATASET ...... 21

3.7 TERMINOLOGY ...... 22

3.7.1 Bounding box ...... 22

3.7.2 True/False Positive/Negative ...... 23

3.7.3 Recall ...... 23

3.7.4 Precision ...... 23

3.7.5 Accuracy ...... 23

3.7.6 Confidence ...... 23

3.7.7 Aggregated vs. Non-Aggregated models ...... 24

3.7.8 Supervised vs. Unsupervised learning ...... 24

3.7.9 Overfitting ...... 24

3.8 CONVOLUTIONAL NEURAL NETWORK (CNN) ...... 24

3.8.1 Convolutional layers ...... 24

3.8.2 Pooling layers ...... 25

3.8.3 Fully connected layers ...... 25

4 Results ...... 27

4.1 TRAINING WITH DIFFERENT SUBSETS ...... 27

4.2 TRAINING WITH DIFFERENT NUMBER OF ITERATIONS ...... 29

4.3 EDGE TRAINING RESULTS ...... 32

4.4 ENSEMBLE RESULTS ...... 34

5 Discussion ...... 36

5.1 LIMITATIONS ...... 36

5.2 RESULT DISCUSSION ...... 36

5.3 METHOD DISCUSSION ...... 38

6 Conclusions and further research ...... 39

6.1 CONCLUSIONS ...... 39

6.1.1 Practical implications ...... 40

6.1.2 Scientific implications...... 40

6.2 FURTHER RESEARCH ...... 40

6.2.1 Android ...... 40

6.2.2 Apple support for object detection on the edge ...... 41

6.2.3 Federated learning ...... 41

6.2.4 Another ensemble method ...... 42

7 References...... 44

Abstract

Deepfakes are an emerging problem in social media and for celebrities and political profiles, it can be devastating to their reputation if the technology ends up in the wrong hands. Creating deepfakes is becoming increasingly easy. Attempts have been made at detecting whether a face in an image is real or not but training these machine learning models can be a very time-consuming process. This research proposes a solution to training deepfake detection models cooperatively on the edge. This is done in order to evaluate if the training process, among other things, can be made more efficient with this approach.

The feasibility of edge training is evaluated by training machine learning models on several different types of iPhone devices. The models are trained using the YOLOv2 object detection system.

To test if the YOLOv2 object detection system is able to distinguish between real and fake human faces in images, several models are trained on a computer. Each model is trained with either different number of iterations or different subsets of data, since these metrics have been identified as important to the performance of the models. The performance of the models is evaluated by measuring the accuracy in detecting deepfakes.

Additionally, the deepfake detection models trained on a computer are ensembled using the bagging ensemble method. This is done in order to evaluate the feasibility of cooperatively training a deepfake detection model by combining several models.

Results show that the proposed solution is not feasible due to the time the training process takes on each mobile device. Additionally, each trained model is about 200 MB, and the size of the ensemble model grows linearly by each model added to the ensemble. This can cause the ensemble model to grow to several hundred gigabytes in size.

Keywords: Machine learning, deepfake, artificial intelligence, ensemble, convolutional neural networks, edge, YOLOv2

1

1 Introduction In this study, deepfake detection models are trained locally on multiple iPhone devices using the YOLOv2 system (Redmon, 2016), which is part of the Darknet open-source machine learning framework. YOLOv2 is used to train models using parts of an image dataset and the models are combined into an ensemble once the models have been trained. A method is proposed for how this can be done, and the feasibility of the method is tested by measuring potential benefits in time, resource efficiency and complexity compared to training on a .

1.1 Problem statement

1.1.1 Deepfakes Deepfakes can, for instance, be face manipulation in images and videos. For a more detailed technical explanation of deepfakes, see section 3.5. Deepfakes often target celebrities and political profiles to damage their reputation. Deepfakes can also be used to spread misinformation by using the voice and/or face of political figures. Deepfakes are an emerging threat in social media and for celebrities and political profiles, it can be devastating to their reputation if the technology ends up in the wrong hands. Deepfakes are making it increasingly difficult to distinguish real from fake. The creation of deepfakes is also becoming increasingly simple through the use of various software such as Faceswap1 and Deepfakesweb2. Successful attempts have been made at detecting whether a face in an image is real or not (Afchar et al., 2018; Guarnera et al., 2020). However, training these machine learning models can be a very time- consuming process (Hecht, 2019).

1.1.2 Computational power in machine learning As the world becomes more and more digitalized, the amount of the data we produce is only expected to increase. A Boeing 787 generates approximately 5 gigabytes of data every second (Shi et al., 2016). Every week, 28 billion photos are uploaded to Google Photos (Ben-Yair, 2020). Subsequently, the processing power required for many applications today is expected to increase as well. This is especially true for machine learning applications, which typically require very large datasets and a “voracious appetite for computing power” (Thompson et al., 2020) in order to be trained with good accuracy in a reasonable amount of time. Some examples of this include training of autonomous vehicles, image classification and scientific data analysis. Currently, the majority of applications requiring a lot of computational power (such as machine learning) are running in the cloud. However, major tech-companies are exploring the

1 https://faceswap.dev 2 https://deepfakesweb.com

2

benefits and make use of edge computing solutions and it is a fast-growing field both in research and among companies (Kairouz et al., 2019). Edge computing refers to the distribution of computation over several non-centralized edge devices, see definition in section 1.2.1. The data centers that power cloud services, such as Amazon Web Services and Microsoft Azure, typically use some of the best hardware available on the market. Advances in chip design and manufacturing have allowed these data centers to become powerful. A concept known as Moore’s Law state that the number of transistors that can be fit on a microchip can be expected to double every two years (Schaller, 1997). Historically, this statement has been accurate, while in some cases, progress has been even faster than that. Currently, the size of the smallest transistor found in modern consumer-grade processors is 5 nanometers (Cutress, 2020). For reference, the size of an atom is roughly 0.1 to 0.5 nanometers in diameter. Hence, chip manufacturers are slowly approaching the physical limits of how small the transistors can become. The end of Moore’s Law is expected to occur during this century. Consequently, advances in computing power are expected to slow down (Wardynski, 2019). This can become a problem for the data centers mentioned previously. As the data to be processed increase, the computational power necessary to process this data will increase as well, requiring either more efficient algorithms or faster chips (Schaller, 1997). Therefore, it will become beneficial to find new solutions that can outperform existing cloud solutions in their computational capabilities. Additionally, it will become more and more important to make use of existing hardware, not only as advances in computational power slows down, but also since the chip manufacturing process has a high environmental impact (Williams, 2004). Many of the devices used by end consumers, such as smartphones and , feature chips with high capabilities. Manufacturers of these devices typically market these improvements by embracing the speed and efficiency caused by these chips. It is quite common, however, that individual consumers who purchase these devices only do common tasks like browsing the web and social media (Brown, 2019), which require only a small fraction of the processing power available within these chips. It is quite rare for the chips in these devices to be utilized to their full capability. This research aims to provide a solution for deepfake detection at large scale and to see whether the proposed solution can help speed up the process of training deepfake detection models cooperatively. Existing solutions have used a decentralized approach where the models being trained use the data available on the edge device. This research, in contrast, takes a centralized approach where the data that is used to train is located

3

on a desktop computer. Additionally, existing solutions train the models using unsupervised learning, while the proposed solution will use supervised learning.

1.2 Literature review

1.2.1 Edge training and federated learning An edge device, in the scope of this research, refers to smartphone devices. In general, an edge device can refer to most devices that are not located in a centralized environment such as the cloud. This can include devices such as smartphones, computers, laptops and IoT devices (Shi et al., 2016). The concept of machine learning on the edge is not new. In fact, there are a whole branch of machine learning called federated learning which is defined as the orchestration of edge devices (such as smartphones, laptops and IoT devices) to contribute computational power to a common machine learning objective (Kairouz et al., 2021). Federated learning provides a more privacy-oriented and resource efficient solution compared to cloud-based machine learning (Bonawitz et al., 2019). The current state-of-the-art research of edge training is mainly focused on federated learning. Many implementations have been explored and discussed in different studies, including autonomous vehicles (Pokhrel & Choi, 2020), healthcare (Xu et al., 2020), and blockchain (Nguyen et al., 2021). No previous studies have attempted to perform training of deepfake detection models by utilizing federated learning or machine learning based on edge training. Federated learning is not used in this research because it is a decentralized technique that uses private data located on the edge devices which means that the server should not have access to the used dataset. In section 6.2.3, more details about federated learning and its use cases are presented for further research. Instead of using federated learning, bagging (see section 3.3.1) is the technique implemented because of the fact that it is a centralized technique that make use of only one global dataset. It is worth noting that this research does not discuss or tries to solve the problems that are related to edge computing. However, the authors think that it is important to make the reader aware of the existence of such problems. In research from (Shi et al., 2016) several challenges that can arise when doing edge computing are mentioned as follows: Programmability, Naming, Data Abstraction, Service Management, Privacy and Security, Optimization Metrics. More details about each problem with suggested solutions are also mentioned in the same research.

1.2.2 Deepfake detection Current state-of-the-art deepfake detection methods are built using various frameworks and algorithms. Several studies claim to reach a detection rate of over 90% (Afchar et

4

al., 2018; Guarnera et al., 2020). No studies were found of training deepfake detection models on the edge. Existing deepfake detection methods, similar to this research, utilizes Convolutional Neural Networks (CNN) to detect deepfakes (Guarnera et al., 2020; Guera & Delp, 2018). Most of the existing research utilize Convolutional Neural Networks in some way, since this is a very suitable type of neural network for deepfake detection. However, the majority of deepfake detection systems use specialized software that have been tailored towards detecting deepfakes and use algorithms that are more optimized for this task. In contrast, the system used in this research (YOLOv2) is an object detection system that was made to detect any type of object. No previous studies were found that test the feasibility of using generic object detection systems, such as YOLOv2, to detect deepfakes. The reason for choosing YOLOv2 is discussed in section 1.4.

1.3 Purpose and research questions The purpose of this research is: To evaluate the feasibility of training deepfake detection models cooperatively on the edge. This research aims to answer the following questions in order to achieve the previously mentioned purpose: 1. How can we combine the computational power of multiple mobile devices to create an efficient deepfake detection solution? Due to the problems specified in section 1.1, machine learning applications need to look for alternative methods of training models as the datasets and the requirements in processing power increases. By answering this research question, an alternative can be established. If the proposed solution is proved to be feasible, users of machine learning cloud services (such as Amazon SageMaker3 and Azure Machine Learning4) could expect to see large cuts in costs compared to using the cloud. Instead of using the cloud, they could leverage the computational power of devices from the end-users. One of the challenges with training a model locally on a mobile device is the size of the dataset needed to train a model that provides accurate deepfake detection results. This is a challenge since the hardware is much more limited in comparison to cloud computation because there is less memory and less storage on a mobile device and that

3 https://aws.amazon.com/sagemaker/ 4 https://azure.microsoft.com/services/machine-learning/

5

is why it might become beneficial to use the computational power of millions of mobile devices to contribute to the process. The advances in deepfake creation tools are making it harder to distinguish real from fake. Researchers believe that it might become impossible to detect deepfakes in the future. To combat the threat of deepfakes, researchers suggest continuation of innovation and research within the field of deepfake detection (Katarya & Lal, 2020). By leveraging the power of millions of mobile devices, their computational power can be combined to train more advanced deepfake detection algorithms on a much larger scale. 2. How does the aggregation of models affect the accuracy of the final trained model in comparison to a non-aggregated model?

The standard method of performing machine learning is by training the model sequentially. Doing it this way, you train the model on one device, from start to finish. For the proposed solution, several models are trained in parallel with unique data for each device. The models are then combined into an ensemble model using the bagging method, as explained in section 3.3.1. Answering this question will allow for an evaluation regarding what differences (if any) are caused by the combining process. Differences in the accuracy of the trained models (ensemble model vs sequential model) are evaluated. Additionally, the amount of time the combining and converting processes take are evaluated.

1.4 Scope and limitations The research will be based on existing tools to train models to detect deepfakes. Therefore, the focus is not to improve existing algorithms of detecting deepfakes but rather to provide a new approach to training deepfake detection models on a large scale. The client-side part of the study will focus on the current version of iOS, which (as of this writing) is iOS 14, and the hardware will be limited to a set of iPhone devices (iPhone 8, iPhone X, iPhone 11, and iPhone 12 Pro). The proposed solution aims not to answer questions such as: - What benefit does the solution offer to the edge user (client)? - Why would the edge user want to contribute their processing power towards machine learning? - How can we guarantee that a sufficient amount of data will actually be used in the training process?

Answers to these questions depend on what type of application is utilizing the solution and what purpose the application in question has. It is up to the owner of the solution

6

to decide how to engage the edge users in contributing their processing power. Optimally, the proposed solution runs as a simple background task that the edge user never notices, but this approach might not be suitable for all applications. The YOLOv2 system was chosen as it provides state-of-the-art object detection. Additionally, YOLOv2 uses Convolutional Neural Networks (explained in section 3.8), which is what other deepfake detection systems have used to successfully detect deepfakes (Afchar et al., 2018; Guarnera et al., 2020). The results show that YOLOv2 is capable of detecting deepfakes, as discussed in section 5. Another reason for using YOLOv2 is the fact that it is written in the C programming language. The C language is supported by default on iPhone devices and makes it possible to run YOLOv2 with pre-built binaries for the iOS platform. Initially, the idea was to use the Core ML framework created by Apple, as this would allow the authors to utilize the neural engine found in Apple devices, speeding up training substantially (Shirakawa, 2021). Due to Apple’s lack of documentation for the Core ML framework, and the lack of compatibility with other frameworks (such as Keras and TensorFlow), Core ML is not used. Initial testing with Core ML displayed the challenges in using this framework over other frameworks like Keras.

1.5 Disposition The structure of the following content is as follows: Chapter 2 – Method and implementation: Describes the research methods and work process. The proposed solution is presented in more detail. Chapter 3 – Theoretical framework: The frameworks and libraries used in the research, such as YOLOv2, are discussed. Terminology is explained and some further theory behind neural networks are presented. The three major ensemble methods are explained. Chapter 4 – Results: The collected data is presented. Data from both the edge devices () are presented and the results from the ensemble are displayed. Chapter 5 – Discussion: The results of the study are analyzed and discussed in relation to previous research. The limitations of the research are also discussed, such as whether YOLOv2 is suitable for deepfake detection. Chapter 6 – Conclusions and further research: Conclusions and implications from the study are presented and suggestions for further research are provided.

7

2 Method and implementation The research is conducted as an experimental based empirical research and it is divided into two phases, data collection and data analysis. Based on the results from the second phase, a proposed solution is evaluated. This proposed solution describes an approach to implement cooperative edge deepfake detection, see section 2.5. During the data collection phase, data is collected from two different platforms: iPhone devices: As the main purpose of the study is to evaluate training deepfake detection models on the edge, an iOS application is created to be used for the training of models. The devices that are used are iPhone 8, iPhone X, iPhone 11 and iPhone 12 Pro. All the devices are running the latest version of iOS, which (as of this writing) is iOS 14. Table 1 shows the specifications of these devices. Table 1, Specifications of edge devices

CPU RAM Neural Engine

iPhone 8 2 GB Dual core

iPhone X Apple A11 3 GB Dual core

iPhone 11 4 GB 8-core

iPhone 12 Pro Apple A14 6 GB 16-core

Desktop computer: In the scope of this research, a computer is used as a stand-alone training solution that is used to train and test deepfake detection models. A server on the computer is also created and runs locally. The server receives the incoming models, then it converts them to Keras format before starting the ensemble process. Table 2 shows the specifications of the computer that is used. Table 2, Specifications of the training computer

CPU GPU RAM OS Intel(R) Core (TM) i5-9600KF CPU GeForce 16 GB Windows 10 @ 3.70GHz (6 core) RTX 2060 2666 MHz Pro

2.1 Data collection Quantitative data is collected when training the models on both the iPhones and the computer. It is important to mention that both platforms make use of the same

8

Darknet library for the training. Therefore, the platform where the training is executed has no effect on the model itself. The following data is collected from training on the iPhones: - The time it takes to train each model locally on the phones. - The level of power consumption during the training process. - CPU and RAM usage during the training process.

The following data is collected from training on the computer: - The time it takes to train each model on the computer. - The accuracy rate of the models that are trained with different subsets of data but same number of iterations. - The accuracy rate of the models that are trained with the same subset of data but with different number of iterations. - The time it takes to convert the models to Keras format. - The time it takes to aggregate the trained models. - The accuracy rate of the ensemble model.

The training of the deepfake detection models is done through supervised learning which means that different sets of labeled data is used to train it. A subset of images from the dataset (see section 3.6) is used to train the models on the iPhones and the computer using YOLOv2. Once a model is trained, it is sent to the server and merged into the existing ensemble that is available there.

2.2 Data analysis The quality and efficiency of the deepfake detection model is measured by analyzing multiple factors such as time of execution (training, converting and combining), accuracy of results and the mobile devices’ resource utilization such as the level of power consumption, RAM and CPU usage. Models are trained with different configurations (such as the number of iterations and the data subset) in order to analyze the impact of these parameters on the trained model. All trained models are then tested on a different subset of the dataset (not the subset used for training). The collected testing data is then used to calculate the accuracy of the models which is used for evaluating the models. For more details, see section 3.7.5. The data collected from training and testing the models refers to measurements such as whether the time of training the models on the edge is reasonable, if merging the model

9

from multiple sources is a good way to train a model at scale and if the results will be affected by the conversion and/or aggregation process.

2.3 Validity and reliability The dataset used for this research is available online5. As such, the data that is used will not be confidential to this research. This allows potential future readers to verify the research by using the same dataset. The machine learning tool YOLOv2 that is used to run the trained models on iOS is a tool that is freely available as open source6. All the work is documented to the point where any reader should be able to replicate the work that has been done. Any code written for this research is published as open source so that anyone can see it and use it (available under the MIT license). The devices used in this research are common devices (iPhones) that most people are familiar with (Team Counterpoint, 2021). To ensure validity, the size of the dataset that is used to test the trained models is large enough to guarantee that the results provide a high level of confidence (see definition in section 3.7.6). Therefore, all models are tested with a dataset with 2000 images (1000 real and 1000 fake). In this research, an accuracy of 70% is determined as the minimum acceptable accuracy. This means that at least 70% of all the images are correctly identified as either real or fake. Each detection comes with a confidence score that indicates how much the model is certain about the detection. In this research, all detections should have a confidence score of at least 70% to be accepted. For instance, if a model says that a fake image is fake with a confidence score of 60%, this detection will not be accepted. Both accuracy and confidence are explained in detail in section 3.7.

2.4 Considerations It is important to keep in mind that the dataset used has a great impact on the accuracy of the deepfake detection model (Pathak et al., 2018). Additionally, the choice of algorithm that is used to detect deepfakes impacts the quality of the detection model (Wang & Dantcheva, 2020). The part of the research regarding training the models on the edge should not be directly impacted by the dataset used. The proposed solution expects that the model received from the edge user is actually trained on the data that it is expected to be trained on. In the real world, however, the edge user simply cannot be trusted to have trained their model on the expected data. Due to time constraints, this will not be a focus of this research, but it is important to keep in mind.

5 https://www.kaggle.com/xhlulu/140k-real-and-fake-faces 6 https://pjreddie.com/darknet/yolov2/

10

Despite the fact that the main focus of the research is detecting deepfakes, the findings from this research should be applicable to other forms of machine learning as well. The conclusions made will be generalized to work for most machine learning applications at large scale.

2.5 Proposed solution of cooperative edge deepfake detection The goal of the research is to be able to draw conclusions about the efficiency of the proposed solution, if its technically feasible to implement and to contribute to the field of machine learning and deepfake detection. The final results from the data collection and data analysis phases are going to decide if this implementation is applicable. The following figure (Figure 1) depicts the architecture behind the proposed solution.

Figure 1, Architecture of the proposed solution.

The workflow of the proposed solution is as follows:

1. The edge device requests a subset of the training dataset by calling the /GetImages API-endpoint. 2. The server responds with a unique subset of the dataset which the ensemble model has not yet used for training. 3. The edge device uses the images to train a model locally. 4. Once the device has finished training the model, the model is sent to the /AggregateModel API-endpoint. 5. The server receives the trained model and combines the received model together with the current ensemble model.

11

3 Theoretical framework There is research that focuses on similar aspects to this research. These aspects include federated learning, model ensemble, edge training, object detection and more. While this research takes inspiration from many of these concepts, this research differentiates itself by applying the aforementioned concepts to the field of deepfakes. Previous to this research, this field is yet to be explored.

3.1 Darknet YOLOv2

3.1.1 Basics This research utilizes a library called Darknet. More specifically, the research uses the YOLOv2 (also known as YOLO9000) object detection system (which is a part of Darknet) for detecting deepfakes. YOLOv2 is an object detection system that uses Convolutional Neural Networks (CNN) to train models that can detect specific objects and their locations inside an image. YOLOv2 is a system based on supervised learning. This means that the datasets used for training are already labelled, i.e., the system trains on images that has already been annotated with the locations and the types of the objects in the images. After training, YOLOv2 can infer the type of object (real or fake in the case of this research) and the location of the object inside the given image. The image below (Figure 2) visualizes the output of the prediction from a sample dataset.

Figure 2, Visualization of YOLOv2 prediction

12

After training, YOLOv2 produces weights. Weights describe how strong the connections are between the neurons in the network. The weights determine how much impact the input (image) neurons have on the result (predictions). Figure 3 visualizes the structure of a simple neural network. In Figure 3, the input layer represents the image(s) that are inserted into the network, and the output represents the predicted boxes, classes, and confidence scores in the provided image(s).

Figure 3, Visualization of a simple neural network

YOLOv2 is selected due to its current compatibility with various machine learning frameworks such as TensorFlow and Keras, allowing for modification of the produced weights. This is useful in the aggregation process, where the produced weights by each edge device can be combined into an ensemble model.

3.1.2 Details The detection process of YOLOv2 can be divided into 3 distinct phases: 1. Resize image. 2. Run Convolutional Neural Network (CNN).

13

3. Run the non-maximum suppression algorithm.

During the first phase, YOLOv2 takes an image as an input and resizes it appropriately. For the YOLOv2 configuration used in this research, the images are resized to 608 x 608 pixels. This is in order to divide the image into a grid of 19 x 19 where each cell is 32 x 32 pixels wide. YOLOv2 will always resize the image so each cell is 32 x 32 pixels (Redmon & Farhadi, 2017), but the size it will resize the image to can vary depending on which configuration is used. The higher resolution of the image, the higher accuracy can be expected (Hui, 2019). During the second phase, YOLOv2 assigns each cell to run their own predictions and selects the object class (for example “fake”) that has the highest probability in each particular cell. This process can be visualized with the following figure (Figure 4).

Figure 4, Visualization of YOLOv2 grid (from Towards Data Science (Gupta, 2020))

Figure 4 displays a “before and after” of the YOLOv2 grid separation, where each cell is displayed with the color of its corresponding object class. Note that YOLOv2 can run several predictions in a single cell, i.e., a single cell can have several bounding boxes. The algorithm, nonetheless, chooses a single class with the highest probability. This can cause certain limitations in how many objects can be detected in a small area. The Convolutional Neural Network (CNN) in YOLOv2 is built using 24 convolutional layers and 2 fully connected layers at the end. YOLOv2, unlike other object detection frameworks, use a single neural network to simultaneously obtain both the bounding boxes and their probabilities. Most object detection frameworks compute these processes separately, thus making them slower compared to YOLOv2 (Redmon et al., 2016). The convolutional layers are what actually finds objects and the characteristics in the image based on the trained weights. For a more detailed description of Convolutional Neural Networks (CNN), see section 3.8.

14

During the last phase, the YOLOv2 system runs an algorithm called the non-maximum suppression algorithm. This algorithm is used to filter out superfluous predictions, as there is likely to be a number of predictions in the image with very low confidence. The algorithm can be visualized with the following figure (Figure 5):

Figure 5, Visualization of non-maximum suppression (from Towards Data Science (Sambasivarao, 2021))

The image displays how YOLOv2 creates several similar predictions and then selects the prediction with highest probability.

3.2 Combining the models A method called ensemble learning is used to combine the produced weights from each edge device into a single model. There are several ensemble techniques available in order to merge several machine learning models. Three popular techniques for model ensemble are called bagging, boosting, and stacking (Opitz, & Maclin, 1999). For this research, the bagging method is used. Bagging combines the output of several so-called weak learners (the models produced by the edge devices) in order to minimize errors and increase the accuracy of the combined model. In order to combine the models into an ensemble, a framework called Keras is used. Keras requires the ensemble members to be Keras models and therefore it is necessary to convert the weights produced by the YOLOv2 system to Keras format. To do this, a library called YAD2K (Zelener, 2017) is utilized.

15

3.3 Ensemble methods In this section, a review of each of the previously mentioned ensemble methods are provided, in addition to the reasons that lead to preferring one ensemble method over the others in the scope of this research.

3.3.1 Bagging With the bagging method, multiple models of the same learning algorithm (homogeneous models) are trained with randomly picked subsets of the training dataset (Opitz, & Maclin, 1999). When the training is finished, the models are ensembled and a voting or averaging process should decide the output of the models. In the case of this research, bagging is the most suitable option for creating the ensemble model as all models make use of the same algorithm. To perform the averaging process, the models are concatenated using an average layer that decides the output based on the average taken from the trained models. Figure 6 visualizes the bagging process.

Figure 6, Overview of ensemble process (Bagging)

16

Figure 7 below was generated using an application called Netron7. The figure shows the average layer of two models that have been ensembled using the bagging method. For this ensemble, the average layer is placed after the last layer in the neural network of the YOLOv2 models. The average layer8 in the Keras framework is used. The average layer is responsible for calculating the average value of the outputs of all the models in the ensemble. The calculated average will be the final output of the ensemble.

Figure 7, The average layer of an ensembled model with bagging

3.3.2 Boosting With the boosting method, models are trained with subsets of the dataset to produce weak models that need to be improved. After the initial training is finished, the process enters a second phase to “boost” the performance of the trained models by training the model with a subset that includes the data that lead to false detection in the previous model (Opitz, & Maclin, 1999). This type of training that builds on previous iterations is called sequential training. Although boosting makes use of homogeneous models, it is not used in this research since the models are trained in parallel on the edge in order to speed up the training process. Figure 8 visualizes the boosting process.

7 https://netron.app/ 8 https://keras.io/api/layers/merging_layers/average/

17

Figure 8, Overview of ensemble process (Boosting)

3.3.3 Stacking With the stacking method, several heterogenous models (models trained using different algorithms) are trained with the whole dataset. Using an ensemble function, the outputs of these models are then combined to be used as the input of the ensemble model. In the case of this research all the models are homogenous, therefore this ensemble method is not used. Figure 9 visualizes the stacking process.

Figure 9, Overview of ensemble process (Stacking)

3.4 Programming languages and platforms Xcode and Swift are used to create the iOS application. The Darknet library is built using the C programming language. Since Xcode allows the usage of C code in iOS

18

applications, the Darknet library can be compiled directly towards the iOS platform. This makes it very easy to use all the features of the Darknet library natively inside the iOS application, including training and predictions directly on the device. Additionally, the Swift programming language allows developers to call functions written in C directly from Swift code by using a “bridging header”. This enables the application to be written in only the Swift language, which is an advantage if the required knowledge in C-language is missing. The server used to aggregate the models is written in Python and makes use of the Flask web development framework. Python is used because there are many relevant machine learning libraries that, among other things, can be used to create ensembles of models.

3.5 Deepfakes explained Deepfakes are typically defined as face manipulation in images and videos. This usually means that an existing image or video is used as reference, while a face from a secondary image or video is used to manipulate the original image or video (Katarya & Lal, 2020). Deepfakes of this type typically include face manipulations of celebrities or other famous individuals. Various machine learning algorithms are used to learn the facial characteristics and behaviors of these individuals in order to create an as accurate and realistic representation of the actual person as possible. A recent example is a deepfake of the actor Tom Cruise that surfaced in various social media in early 2021, as displayed in Figure 10.

19

Figure 10, Deepfake of Tom Cruise (from The Verge (Vincent, 2021))

Another form of deepfake are faces generated from Generative Adversarial Networks (GAN). The website This Person Does Not Exist9 (Goodfellow, 2019) displays the current capabilities in artificially generated faces using GAN. Deepfakes of the GAN type does not explicitly target any specific face or individual but instead generates entirely novel faces. This research focuses primarily on deepfake images generated through GAN. Evident from images found in social media and the rest of the internet, deepfakes are becoming increasingly difficult to discern from real images. With the help of edge devices and large-scale machine learning, the authors hope to contribute to research in discerning whether a face is real or fake.

9 https://thispersondoesnotexist.com/

20

3.6 Dataset For this research, a dataset from the web platform Kaggle is used to train the deepfake detection model. Kaggle is a platform where data scientists can share datasets, code, host competitions and more. For this research, a random subset of a dataset containing a total of 140,000 images is used10. In the full dataset, half of the images represent photos of real people and the other half represent images of GAN-generated faces (Lu, 2020). The real images in the dataset were crawled from the web platform Flickr by a team at Nvidia11. The fake images in the dataset were generated by a research project called StyleGAN (Karras et al., 2019), also at Nvidia12. Both the real and fake images are RGB-coloured and share the same resolution (1024 x 1024). A subset containing 500 real and 500 fake images is used for the training of the YOLOv2 deepfake detection models in this research. The dataset with 140,000 images is not annotated in the format required by the YOLOv2 system. An annotation refers to the bounding box (see definition in section 3.7.1) that tells YOLOv2 where the face (real or fake) is located in the image. This is what YOLOv2 uses to train its neural network. YOLOv2 requires that each image in the dataset has a corresponding text-file located in the same folder that contains the annotations for each corresponding image. Each annotation should be formatted as follows:

Example: 1 0.467 0.616 0.619 0.645

The , , and attributes are relative, meaning they are clamped between the values 0 and 1. The and values represent the coordinates of the center of the bounding box. If, for example, and both have a value of 0.5, the center of the bounding box is at the exact center of the image itself.

In the example above, the attribute is the number 1 (instead of “real” or “fake”). This is because the YOLOv2 framework reads the object classes from a separate text-file, where each line in the file represents the corresponding object class. Our classes text-file simply looks like the following:

10 https://www.kaggle.com/xhlulu/140k-real-and-fake-faces 11 https://github.com/NVlabs/ffhq-dataset 12 https://github.com/NVlabs/stylegan

21

real fake In the case of this research, the number 0 represents the “real” object class, and the number 1 represents the “fake” object class, since it is on lines 0 and 1 these classes occur in the file. Manually creating the annotations for each image would, undeniably, take a very long time. For the purpose of labelling each image, a custom script is created using Node.js and a library called Blazeface13 (Bazarevsky et al., 2019). The script automatically detects faces and annotates the class and location of the bounding box inside each image of the dataset. The source code for this script is available on GitHub14.

3.7 Terminology

3.7.1 Bounding box A bounding box represents the size and location of the detected object inside a given image. In the case of this research the detected objects are faces. The bounding box is usually visualized with a colored line around the detected object together with predicted class and the confidence of the prediction, as seen in Figure 11.

Figure 11, Visualization of YOLOv2 predictions of a deepfake and a real image

13 https://github.com/tensorflow/tfjs-models/tree/master/blazeface 14 https://github.com/xjarvik/face-annotator-yolo

22

3.7.2 True/False Positive/Negative A True Positive (TP) appears when an object detection model correctly identifies the desired object. In the case of this research, this occurs when a model correctly identifies an actual deepfake. A False Positive (FP) appears when an object detection model incorrectly identifies the desired object. In the case of this research, this occurs when a model incorrectly identifies a real face as a deepfake. A True Negative (TN) appears when an object detection model correctly identifies the undesired object. In the case of this research, this occurs when a model correctly identifies a real face. A False Negative (FN) appears when an object detection model incorrectly identifies the undesired object. In the case of this research, this occurs when a model incorrectly identifies a deepfake as a real face.

3.7.3 Recall The recall defines how well a model finds all positives in a certain dataset. The recall can be defined as follows (Hui, 2020): 푇푃 푅푒푐푎푙푙 = 푇푃 + 퐹푁

3.7.4 Precision The precision defines how accurately a model identifies positives, i.e., what percentage of predictions are correct. The precision can be defined as follows (Hui, 2020): 푇푃 푃푟푒푐푖푠푖표푛 = 푇푃 + 퐹푃

3.7.5 Accuracy In the scope of this research, Accuracy defines the percentage of correct predictions whether the prediction is for a fake or a real image. The accuracy can be defined as follows: 푇푃 + 푇푁 퐴푐푐푢푟푎푐푦 = 푁푢푚푏푒푟 표푓 푖푚푎푔푒푠

3.7.6 Confidence Each detection comes with a confidence score that indicates how much the model is certain about the detection. Confidence is presented as a percentage in decimal format. In this research, all detections with a confidence of 50% and above are taken into account. Figure 11 shows how confidence is attached to a bounding box.

23

3.7.7 Aggregated vs. Non-Aggregated models In this research, the term aggregated model refers to the ensemble model that has been merged together with several models to increase accuracy. The non-aggregated models are the models that are trained individually that are not directly a part of the ensemble.

3.7.8 Supervised vs. Unsupervised learning The term supervised learning refers to training of machine learning algorithms using data that is labelled (Soni, 2020). The term unsupervised learning refers to training of machine learning algorithms using unlabeled data, so it is up to the model itself to discover patterns and learn from incoming data (Soni, 2020).

3.7.9 Overfitting Overfitting typically occurs when a model has been trained for “too long”. This can cause the model to start learning on characteristics that are specific to the data in a specific dataset. Consequently, the model can become less general, making it perform worse on new data it has not seen before (Brownlee, 2019). Because of this, it is important to make sure the dataset is both large and general enough in order to prevent overfitting. It is also important to make sure the model is not trained for too many iterations.

3.8 Convolutional Neural Network (CNN) Convolutional Neural Networks (CNN) are similar to traditional Artificial Neural Networks (ANN). They key difference is that CNN are primarily used for pattern recognition on images and thus allowing for optimization of the architecture of the neural network, making the network better suited for image-related tasks (O'Shea & Nash 2015). CNNs, as traditional ANNs, consists of an input layer, hidden layers and an output layer. The key difference is in the hidden layers. CNNs hidden layers are mainly three types of layers which are convolutional layers, pooling layers and fully connected layers. The convolutional layers are the basis of the CNN.

3.8.1 Convolutional layers The convolutional layers receive an input, transforms the input by performing a convolutional operation and then outputs the result to the next layer. It is in the convolutional layer where the pattern detection is made. The patterns detected can for e.g., be edges, shapes and objects in images.

24

A convolutional layer consists of filters, and it is more precisely these filters that are responsible for the detection of patterns. The filters in the beginning of the network are relatively simple and can for instance detect simple shapes or edges while the filters deeper in the network are more advanced and can detect objects such as cats or dogs. The filters can be thought of as matrices with specified rows and columns where each cell holds a value which is initialized with a random number. When a convolutional layer with for e.g., a 3x3 filter receives an image as input, the filter will convolve over each 3x3 pixels in the image. The dot product of the filter and each 3x3 section in the input image will be computed and stored until the filter has convolved over the whole image. The results will then be outputted as the input for the next layer. This is known as a convolutional operation.

3.8.2 Pooling layers The pooling layers are responsible for down sampling the given input. The layer will reduce the width and height of the input. A pooling layer with a filter of size 2x2 will reduce the input by half, in width and in height. If the layer is a max-pooling layer with filters of size 2x2 and stride 2, the highest value in each 2x2 section of the input will be saved as depicted in Figure 12. In contrast an average-pooling layer will calculate the average value from the values in each 2x2 section in the input and save the average value.

Figure 12, example of a maxpool operation

3.8.3 Fully connected layers The fully connected layers are similar to ANNs (Albawi et al., 2017) and therefore perform the same operations as ANNs. Fully connected layers are the last layers in the network together with the final output layer. The output of the last convolutional or

25

pooling layer will be flattened and used as input for the fully connected layer. Flattening the layers is done since the outputs of the convolutional and pooling layers is 3- dimensional matrices and the results of the flattening process will be a single vector as depicted in Figure 13.

Figure 13, visualization of a flattening process

26

4 Results The following results were gathered during the training and testing of each individual model and from the testing of the ensemble model. When testing a model, each detection gives a percentage that indicates how confident the model is in its prediction. High correct percentages mean that the model is less likely to make wrong detection with similar data. As the main reason of the research is to train models that detect deepfakes, it is important to check the reliability of YOLOv2 as an object detection framework to actually detect deepfake images.

4.1 Training with different subsets The purpose of training the YOLOv2 models with different subsets of images is to provide the ensemble server with the required models that fit the used ensemble method which is bagging. The size of each training subset is 1000 images (500 real and 500 fake) and the number of iterations for all the models were limited to 2000 iterations. The models are also tested on a subset of 2000 images to check their accuracy and to control the deviation of outputs that a different set of training images could cause. Tables 3-7 show the collected data from testing five different models. Time constraints allowed the authors to train a total of five models. Had more time been available, more models would have been trained, as this would have made the results more reliable. In the tables, TP, FN, TN, and FP represent the number of detections made of the total 2000 images in the dataset. Minimum confidence represents the threshold in confidence. To get a better view of how accuracy differentiate between the models see Figure 14. Table 3, 1st model (2000 iterations)

Minimum Confidence TP FN TN FP Accuracy 0,5 645 355 876 54 0,7605 0,6 598 297 819 41 0,7085 0,7 498 225 700 29 0,599 0,8 254 108 362 7 0,308 0,9 5 0 1 0 0,003

27

Table 4, 2nd model (2000 iterations)

Minimum Confidence TP FN TN FP Accuracy 0,5 751 260 885 91 0,818 0,6 720 229 861 80 0,7905 0,7 670 187 799 67 0,7345 0,8 567 119 636 50 0,6015 0,9 143 16 77 6 0,11

Table 5, 3rd model (2000 iterations)

Minimum Confidence TP FN TN FP Accuracy 0,5 644 335 926 36 0,785 0,6 599 296 882 30 0,7405 0,7 529 231 762 17 0,6455 0,8 289 122 401 7 0,345 0,9 4 1 5 0 0,0045

Table 6, 4th model (2000 iterations)

Minimum Confidence TP FN TN FP Accuracy 0,5 670 358 857 105 0,7635 0,6 636 322 829 85 0,7325 0,7 581 275 764 68 0,6725 0,8 456 183 537 40 0,4965 0,9 38 17 29 2 0,0335

Table 7, 5th model (2000 iterations)

Minimum Confidence TP FN TN FP Accuracy 0,5 649 362 895 82 0,772 0,6 606 324 857 64 0,7315 0,7 550 263 780 52 0,665 0,8 401 159 537 25 0,469 0,9 7 3 22 1 0,0145

28

Table 8, Ensemble of models 1-5

Minimum Confidence TP FN TN FP Accuracy 0,5 669 163 830 28 0,7495 0,6 632 158 791 27 0,7115 0,7 565 145 719 22 0,642 0,8 463 130 589 17 0,526 0,9 294 85 352 11 0,323

Accuracy Comparison 0,9

0,8

0,7

0,6 1st model 0,5 2nd model

0,4 3rd model Accuracy 4th model 0,3 5th model 0,2 Ensemble Model 0,1

0 0,50 0,60 0,70 0,80 0,90 1,00 Minimum Confidence

Figure 14, Accuracy comparison between models trained with different subsets and the ensemble model.

4.2 Training with different number of iterations

The number of iterations each model is trained with plays a very important role regarding the final outputs of the trained model. Five models are trained with different number of iterations in order to decide which number of iterations results in best accuracy (note that these models are not the same as in section 4.1). Again, additional models with higher number of iterations would have been trained if the authors were not under time constraint. Note that all the models are trained and tested with the same dataset to make sure that the number of iterations is the only factor affecting the results.

29

Tables 9-13 show the collected data from testing five different models. To get a better view of how accuracy differentiate between the models, see Figure 15.

Table 9, Model trained with 1000 iterations.

Minimum Confidence TP FN TN FP Accuracy 0,5 387 151 416 26 0,4015 0,6 198 76 208 15 0,203 0,7 66 15 75 3 0,0705 0,8 7 1 7 0 0,007 0,9 0 0 0 0 0

Table 10, Model trained with 2000 iterations.

Minimum Confidence TP FN TN FP Accuracy 0,5 752 245 855 70 0,8035 0,6 714 213 788 52 0,751 0,7 630 167 630 36 0,63 0,8 446 88 292 17 0,369 0,9 5 0 1 0 0,003

Table 11, Model trained with 3000 iterations.

Minimum Confidence TP FN TN FP Accuracy 0,5 826 187 864 114 0,845 0,6 799 157 826 93 0,8125 0,7 760 134 761 75 0,7605 0,8 648 82 566 45 0,607 0,9 40 6 24 1 0,032

30

Table 12, Model trained with 4000 iterations.

Minimum Confidence TP FN TN FP Accuracy 0,5 701 314 769 214 0,735 0,6 663 268 723 180 0,693 0,7 615 216 655 141 0,635 0,8 547 165 557 91 0,552 0,9 325 52 186 27 0,2555

Table 13, Model trained with 5000 iterations.

Minimum Confidence TP FN TN FP Accuracy 0,5 785 225 902 92 0,8435 0,6 755 197 876 79 0,8155 0,7 720 165 828 60 0,774 0,8 641 113 659 39 0,8 0,9 57 7 41 2 0,049

The following table (Table 14) displays how long time it took to train the models presented in previous Tables 9-13: Table 14, Time to train the different models on the personal computer.

1000 2000 3000 4000 5000 iterations iterations iterations iterations iterations Average Total training time (hours) 3,48 6,55 9,58 13,64 16,25 Average per iteration (seconds) 12,528 11,79 11,496 12,276 11,7 11,96

31

Accuracy Comparison 0,9

0,8

0,7

0,6 1000 iterations model 0,5 2000 iterations model

0,4 3000 iterations model Accuracy 4000 iterations model 0,3 5000 iterations model 0,2 Ensemble model 0,1

0 0,50 0,60 0,70 0,80 0,90 1,00 Minimum Confidence

Figure 15, Accuracy comparison between models trained with different number of iterations and the ensemble model.

4.3 Edge training results The following data is collected while training a model with only one iteration on an iPhone 8, iPhone X, iPhone 11 and iPhone 12 Pro. The source code for the application that was created for this purpose is available on GitHub15. Figure 16 shows the utilization of resources on an iPhone 11 while training a YOLOv2 model. Table 15 shows how long time it takes to train a model with one iteration on each of the mentioned mobile devices. In contrast, training only one iteration on a computer with the specifications displayed in Table 2, took on average 11,96 seconds to complete, as seen in Table 14.

Table 15, Time to train a model with one iteration on a mobile device.

iPhone 8 iPhone X iPhone 11 iPhone 12 Pro Training time DNF 34m 12s 20m 33s 20m 2s

15 https://github.com/xjarvik/yolo-ios

32

Figure 16, Resource utilization during training a model on iPhone 11 Both the iPhone X and the iPhone 12 Pro reports identical resource usage as shown in Figure 16. However, when running the training on the iPhone 8, the application crashes due to a “memory issue”. See screenshot below (Figure 17).

33

Figure 17, Crash report from iOS application

The crash most certainly occurs due to the device running out of memory. The iPhone 8 is supplied with a total of 2 GB of RAM-memory. The YOLOv2 application allocates roughly 1.6 GB. Consequently, the device runs out of memory and force-closes the application.

4.4 Ensemble results The ensemble server runs on the computer with the specifications displayed in Table 2. The source code of this server is available online at GitHub16. The ensemble is created based on the models displayed in Tables 3-7. Additionally, the results from testing the ensemble model are presented in Table 8 and the accuracy of the ensemble is compared to the accuracy of each individual non-aggregated model in Figures 14 and 15. Table 16 shows the time it takes to convert YOLOv2 weights to Keras format and the time it takes to aggregate each model to the ensemble.

16 https://github.com/EH94/ensemble-server

34

Table 16, Time to convert and aggregate models.

Model 1 Model 2 Model 3 Model 4 Model 5 Average Conversion time (seconds) 10,3 6,65 6,6 6,67 6,76 7,396 Aggregation time (seconds) N/A 4,45 8,9 12,98 16,87 10,8

35

5 Discussion

5.1 Limitations A major limitation is the fact that the neural engine found in iPhone devices starting from iPhone 8 is not utilized. If the training would utilize the neural engine, the speed of training could likely be increased substantially (Shirakawa, 2021). The chosen object detection framework (YOLOv2) does not come with built-in support for the neural engine. Additionally, the neural engine found in iPhone devices is only accessible by software provided by Apple, such as Core ML. As training of models take a substantial amount of time, and the time period for this research is heavily constrained, it was not possible to train a large number of models to be used when testing the feasibility of the proposed solution. During the research, the researchers only had access to a limited set of iPhone devices. Specifically, the models of iPhone were constrained to iPhone 8, iPhone X, iPhone 11, and iPhone 12 Pro. This makes it harder to draw conclusions regarding how the proposed solution will perform on other devices than the ones used for this research. Another notable limitation is the use of object detection framework, namely YOLOv2. While this framework is able to produce usable detections, there are other frameworks that are more specialized for deepfake detection and may therefore be more likely to detect a larger number of deepfakes in comparison to the results provided in this research.

5.2 Result discussion

As seen in Tables 9-13, few of the models are able to achieve the minimum acceptable accuracy as defined in chapter 2, although some of the models get very close. More specifically, the models trained with 3000 and 5000 iterations are the only two models able to achieve a minimum accuracy of 70% where the confidence score is a minimum of 70%. The model with 1000 iterations is the only model that shows nearly unusable results. Only models trained on 2000 iterations or more produce promising results. Of the models trained between 1000 to 5000 iterations, the model with 5000 iterations is the one producing the best result in regard to the acceptance rates.

An interesting thing to be acknowledged from the results is the fact that the models with higher iterations actually produce a lower number of detections in general. However, these models tend to produce predictions with higher confidence, as seen in Figure 15.

When it comes to the ensemble model, it performs about the same as the non-aggregated models (the five models trained on 2000 iterations, as seen in Tables 3-7) in low- confidence predictions. However, the ensemble model is able to make more high-

36

confidence predictions than any of the single models on their own. This can be seen in Figure 14, where the ensemble model has a higher detection rate at a confidence score of above roughly 85%, compared to the members of the ensemble.

Precision and recall are two metrics that are usually used to evaluate models but in the case of this research, accuracy is used instead. The reason for not using precision or recall is that a model might make very few predictions which will lead to a biased result. For example, if a model makes one prediction with over 90% confidence that an image is fake and makes no other predictions in the entire dataset, the resulted precision will be 100% as the model did not make any incorrect detections. This is misleading, as weak models tend to make very few predictions with high confidence (see results from the model in Table 9). Meanwhile, accuracy as defined in this research takes into consideration all kinds of detections even those where no detection was made.

The best performing models in this research are the ensemble model and the model trained on 5000 iterations. Compared to the models in (Afchar et al., 2018; Guarnera et al., 2020), which are trained using different algorithms, the models in this research have worse accuracy. This is most likely due to the fact that the models in the other research use algorithms that are more optimized for this specific task of detecting deepfakes. YOLOv2 in contrast, is just a generic object detection framework designed to detect a variety of different objects. This indicates that there are better algorithms that can be used to detect deepfakes.

The individual ensemble members are trained, because of time constraints, on a computer instead of on the edge, which was the original intention. As can be seen in Table 15, if these models had been trained on the edge devices, the training process would have taken several weeks to complete. However, this does not affect the accuracy of the models in any way as the training process is identical on both of these types of devices.

As seen in Table 16, the time it takes to convert the YOLOv2 models to Keras format and the time it takes to aggregate the models into the ensemble is measured. From the table, it can be seen that the more models the ensemble is built with, the longer the aggregation process takes for new models being added into the ensemble. Additionally, the used ensemble method causes each new model to be added as an individual member of the ensemble, causing the size of the ensemble to grow linearly. Each converted Keras model sits at around 200 MB. Each model adds an additional 200 MB to the ensemble. The consequences of this are discussed in the conclusions section.

37

5.3 Method discussion This study conducted an experimental, and empirical research method, which showed to be a successful approach. The choice of ensemble method is, in theory, the most suitable ensemble method for this research that is currently available. However, it is not optimal for object detection since each model needs to be trained with at least 2000 iterations and this is as concluded in the section 6.1 not feasible on the edge devices. Therefore, the authors propose a new ensemble method, as seen in section 6.2.3, that aims to solve issues such as minimizing the time each device has to train a specific model. The edge training of the models has been carried out by making use of the Darknet framework but there are other tools for deepfake detection that may be more suitable for detecting deepfakes with higher accuracy (Afchar et al., 2018). Additionally, there are other frameworks that may be more specialized for training models on the edge. Notably, the Core ML framework by Apple utilizes the neural engine found in iPhone devices. Using this framework would almost certainly be faster compared to using YOLOv2. However, as noted in section 1.4, this was not an option due to the challenges that were encountered while testing the Core ML framework.

38

6 Conclusions and further research

6.1 Conclusions

The main conclusion that can be made is that the proposed solution is not feasible. The reasons for this are two-fold:

1. The used edge devices (iPhones) are far too limited in their computational capabilities. On average, a computer utilizing a dedicated graphics card can finish a single iteration of training in 11,96 seconds, as shown in Table 14. In comparison, a single iteration on the iPhone 11 takes over 20 minutes. In a real- world scenario, it is unlikely that the user would like their device to train for this long, since the training utilizes roughly 100% of the CPU causing battery usage to be at the highest level during this time. This is especially true assuming the model needs to be trained on more than a single iteration. As discussed in section 6.2, the usage of the neural engine and/or integrated GPU could be a potential solution to this problem. 2. The proposed ensemble method causes the aggregated model to explode in size. Each edge device generates a single weights file of over 200 MB which is sent to the server to be aggregated. Since each model adds an additional 200 MB to the aggregated model, it is clear that the aggregated model (which requires several thousands of iterations to be trained with sufficient accuracy) will likely take up several hundreds of gigabytes of storage once it has been fully trained. Additionally, because of the way the chosen ensemble method functions, each “sub-model” inside the aggregated model has to run their own predictions separately. Assuming the aggregated model contains 4000 models, this implies that the aggregated model will have to run 4000 individual predictions before averaging the output at the end. It is likely that this will take a very long time.

To answer the first research question: How can we combine the computational power of multiple mobile devices to create an efficient deepfake detection solution? The proposed solution does not provide an efficient deepfake detection solution by training on the edge. This is mainly due to lack of support in the utilized frameworks to perform machine learning efficiently with the given resources available on the edge devices. The resource utilization and the time it takes to train is simply too high to be deemed feasible.

39

To answer the second research question: How does the aggregation of models affect the accuracy of the final trained model in comparison to a non-aggregated model? This research comes to the conclusion that the ensemble model does, in fact, improve the detection accuracy when it comes to high confidence detections, and it stays on par with the other non-aggregated models on predictions with lower confidence.

6.1.1 Practical implications As deepfakes become more and more advanced, they will likely become harder to detect (Katarya & Lal, 2020). A solution that allows for potentially millions of edge devices to contribute their processing power could increase efficiency and speed up the training process to create accurate and reliable deepfake detection models. The results of this research, however, proves the difficulties in utilizing edge devices, especially smartphones, to train highly accurate deepfake detection models. Factors such as which object detection framework to use, and which aggregation (ensemble) method should be utilized, are key considerations that this research embraces. This research could contribute to the field by enhancing the awareness of suitable solutions for cooperative edge training.

6.1.2 Scientific implications The main implication on the scientific community is the continued research in the machine learning field on the edge, especially regarding deepfake detection. This research should give a basis for some of the challenges that may arise in the continued research in deepfake detection on the edge. Further scientific research could possibly benefit from the proposed ensemble technique mentioned in section 6.2.3. The proposed solution aims to, among other things, shorten the time to train a deepfake detection model for each specific edge device. If further scientific research is able to find a more suitable method for training the models more efficiently on the edge, this could become a good complement to this research.

6.2 Further research

6.2.1 Android The edge training of machine learning models in this research has been primarily executed on iOS devices which stands for a relatively small amount of the available mobile devices (Team Counterpoint, 2021). Therefore, it could be a good idea for new researchers to perform the same research on Android devices in order to cover a larger part of the smartphone market. Android might offer a different approach for training on the edge that could make use of the available AI accelerators found in the Android devices, such as the neural engine found in the 888 processor

40

(where the neural processor is called the “Hexagon” processor) (Qualcomm Technologies, Inc., 2021). As this research has shown, it is currently not feasible to train models on the edge using iOS devices. Although it is likely that executing the training on an Android device will produce similar results, it could produce different results depending on the Android device, making this an interesting topic to explore.

6.2.2 Apple support for object detection on the edge Apple’s machine learning framework Core ML does not currently support training an object detection model on the iOS platform This leaves a space for newer versions of Core ML that might support training these models to be explored in the future. Core ML training makes use of the neural engine that is found in newer Apple devices which means that it will perform machine learning training more efficiently since it is optimized for this task. Additionally, Apple provides the Metal framework, which could also be used to increase the speed by utilizing the integrated GPU.

6.2.3 Federated learning The solution for edge training presented in this research might not be suitable for all applications. For instance, if the training is done on the users’ private data, it is important that the users’ data is actually kept private and that it is not shared with any third party. However, the solution presented in this research does not take this into account. Further research could benefit from applying training of deepfake detection models by using federated learning, a decentralized approach where the users’ data is kept private. Examples of existing methods that could be relevant include Federated Stochastic Gradient Descent (FedSGD) and Federated Averaging (FedAvg). In FedSGD, a random subset of the edge devices is chosen. Each chosen edge device then calculates a stochastic gradient descent using the dataset that is already present on the edge device (hence, no training datasets are transferred between the server and the edge devices). To put it briefly, stochastic gradient descent can be described as an iterative process of finding the local minimum in an objective function (Ruder, 2020). In this research, the objective function can be defined as finding the most accurate model that can detect whether a face is real or not (Kronovet, 2017). Once the gradient has been calculated, the gradient is sent to a server that aggregates all the gradients from all the edge devices. The aggregation server then calculates an average of these gradients. In FedAvg, the training process begins with an initial model on the aggregation server. The aggregation server then sends this model to each edge device. Each edge device then updates the model over one or several iterations using the local dataset present on the edge device. Once the edge device has updated the model, the model is sent to the aggregation server. The aggregation server then creates an average model of each of the

41

models sent by the edge devices. The aggregation server then sends back the aggregated model to each edge device, after which the process repeats. (Li et al., 2020) The main difference between FedSGD and FedAvg is the data that is sent to the aggregation server. In FedSGD, a calculated gradient is sent, while in FedAvg, the updated model is sent. Additionally, in FedAvg, each edge device can update the model over several iterations, while in FedSGD, a single step of the gradient descent is made every time the received gradients are averaged. The solution presented in this research is similar to FedAvg. The main difference is that, for the solution proposed in this research, the aggregated model is never sent to the edge devices. This means that each edge device must train a new model from scratch. Additionally, a subset of the dataset used for training needs to be sent from the aggregation server to each edge device. This means that each edge device must use centralized, non-private data for training. In contrast, FedAvg can make use of users’ private data for training.

6.2.4 Another ensemble method This research makes use of bagging as the ensemble method as it is currently the most suitable method. Another proposed solution for ensemble is to make use of two different methods (bagging and boosting). Boosting can be used to continue training the same model on a different device in order to allow more devices to contribute to the training process without exhausting the resources of each device. The problem with boosting is that the training process must be performed sequentially which is not optimal in regard to time constraints (each device will wait for another device to finish). The proposed solution as seen in Figure 18, uses bagging to ensemble the resulting models from each boosting ensemble. Further research can explore this solution in order to speed up the training process on the edge devices.

42

Figure 18, Alternative ensemble method

43

7 References

Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017). Understanding of a convolutional neural network. 2017 International Conference on Engineering and Technology (ICET). Published. https://doi.org/10.1109/icengtechnol.2017.8308186

Afchar, D., Nozick, V., Yamagishi, J., & Echizen, I. (2018). MesoNet: a Compact Facial Video Forgery Detection Network. 2018 IEEE International Workshop on Information Forensics and Security (WIFS). Published. https://doi.org/10.1109/wifs.2018.8630761

Bazarevsky, V., Kartynnik, Y., Vakunov, A., Raveendran, K., & Grundmann, M. (2019). Blazeface: Sub-millisecond neural face detection on mobile gpus. arXiv preprint arXiv:1907.05047.

Ben-Yair, S. (2020, November 11). Updating Google Photos’ storage policy to build for the future. Retrieved March 9, 2021, from https://blog.google/products/photos/storage-changes/

Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., ... & Roselander, J. (2019). Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046.

Brown, H. (2019, April 26). What are the most popular reasons why people use their smartphones every day? | Gadget Cover. Retrieved May 21, 2021, from https://www.gadget-cover.com/blog/what-are-the-most-popular-reasons-why-people- use-their-smartphones-every-day

Brownlee, J. (2019, August 12). Overfitting and Underfitting With Machine Learning Algorithms. Retrieved May 10, 2021, from https://machinelearningmastery.com/overfitting-and-underfitting-with-machine- learning-algorithms/

Cutress, I. (2020, August 25). ‘Better Yield on 5nm than 7nm’: TSMC Update on Defect Rates for N5. Retrieved May 21, 2021, from https://www.anandtech.com/show/16028/better-yield-on-5nm-than-7nm--update- on-defect-rates-for-n5

44

Goodfellow, I. (2019). This Person Does Not Exist. Retrieved April 27, 2021, from https://thispersondoesnotexist.com/

Guarnera, L., Giudice, O., & Battiato, S. (2020). Deepfake detection by analyzing convolutional traces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 666-667).

Guera, D., & Delp, E. J. (2018). Deepfake Video Detection Using Recurrent Neural Networks. 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Published. https://doi.org/10.1109/avss.2018.8639163

Gupta, M. (2020, May 30). YOLO — You Only Look Once - Towards Data Science. Retrieved April 16, 2021, from https://towardsdatascience.com/yolo-you-only-look- once-3dbdbb608ec4

Hecht, L. E. (2019, December 19). Add It Up: How Long Does a Machine Learning Deployment Take? Retrieved May 22, 2021, from https://thenewstack.io/add-it-up- how-long-does-a-machine-learning-deployment-take/

Hui, J. (2019, August 27). Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3. Retrieved May 23, 2021, from https://jonathan-hui.medium.com/real-time- object-detection-with-yolo-yolov2-28b1b93e2088

Hui, J. (2020, February 7). mAP (mean Average Precision) for Object Detection - Jonathan Hui. Retrieved May 13, 2021, from https://jonathan-hui.medium.com/map- mean-average-precision-for-object-detection-45c121a31173

Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., ... & Zhao, S. (2019). Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977.

Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4401-4410).

45

Katarya, R., & Lal, A. (2020). A Study on Combating Emerging Threat of Deepfake Weaponization. 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). Published. https://doi.org/10.1109/i- smac49090.2020.9243588

Kronovet, D. (2017, March 28). Objective Functions in Machine Learning. Retrieved June 8, 2021, from http://kronosapiens.github.io/blog/2017/03/28/objective-functions- in-machine-learning.html

Li, L., Fan, Y., Tse, M., & Lin, K. Y. (2020). A review of applications in federated learning. Computers & Industrial Engineering, 149, 106854. https://doi.org/10.1016/j.cie.2020.106854

Lu, X. (2020, February 10). 140k Real and Fake Faces. Retrieved March 15, 2021, from https://www.kaggle.com/xhlulu/140k-real-and-fake-faces

Nguyen, D. C., Ding, M., Pham, Q. V., Pathirana, P. N., Le, L. B., Seneviratne, A., Li, J., Niyato, D., & Poor, H. V. (2021). Federated Learning Meets Blockchain in Edge Computing: Opportunities and Challenges. IEEE Internet of Things Journal, 1. https://doi.org/10.1109/jiot.2021.3072611

Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11, 169-198.

O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458.

Pathak, A. R., Pandey, M., & Rautaray, S. (2018). Application of Deep Learning for Object Detection. Procedia Computer Science, 132, 1706–1717. https://doi.org/10.1016/j.procs.2018.05.144

Pokhrel, S. R., & Choi, J. (2020). Federated learning with blockchain for autonomous vehicles: Analysis and design challenges. IEEE Transactions on Communications, 68(8), 4734-4746.

46

Qualcomm Technologies, Inc. (2021, March 22). Exploring the AI capabilities of the Qualcomm Snapdragon 888 Mobile. Retrieved May 15, 2021, from https://www.qualcomm.com/news/onq/2020/12/02/exploring-ai-capabilities- qualcomm-snapdragon-888-mobile-platform

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).

Redmon, J. (2016). YOLO: Real-Time Object Detection. Retrieved April 18, 2021, from https://pjreddie.com/darknet/yolov2/

Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263-7271).

Ruder, S. (2020, March 20). An overview of gradient descent optimization algorithms. Retrieved June 8, 2021, from https://ruder.io/optimizing-gradient-descent/

Sambasivarao, K. (2021, April 30). Non-maximum Suppression (NMS) - Towards Data Science. Retrieved April 1, 2021, from https://towardsdatascience.com/non-maximum- suppression-nms-93ce178e177c

Schaller, R. R. (1997). Moore's law: past, present and future. IEEE spectrum, 34(6), 52- 59.

Shi, W., Cao, J., Zhang, Q., Li, Y., & Xu, L. (2016). Edge Computing: Vision and Challenges. IEEE Internet of Things Journal, 3(5), 637–646. https://doi.org/10.1109/jiot.2016.2579198

Shirakawa, T. (2021, April 16). Apple Neural Engine in M1 SoC Shows Incredible Performance in Prediction. Retrieved May 16, 2021, from https://medium.com/macoclock/apple-neural-engine-in-m1-soc-shows-incredible- performance-in-core-ml-prediction-918de9f2ad4c

47

Soni, D. (2020, July 22). Supervised vs. Unsupervised Learning - Towards Data Science. Retrieved May 23, 2021, from https://towardsdatascience.com/supervised- vs-unsupervised-learning-14f68e32ea8d

Team Counterpoint. (2021, May 7). Global Smartphone Market Share: By Quarter. Retrieved May 15, 2021, from https://www.counterpointresearch.com/global- smartphone-share/

Thompson, N. C., Greenewald, K., Lee, K., & Manso, G. F. (2020). The computational limits of deep learning. arXiv preprint arXiv:2007.05558.

Vincent, J. (2021, March 5). TikTok Tom Cruise deepfake creator: public shouldn’t worry about ‘one-click fakes.’ Retrieved April 30, 2021, from https://www.theverge.com/2021/3/5/22314980/tom-cruise-deepfake-tiktok-videos-ai- impersonator-chris-ume-miles-fisher

Wang, Y., & Dantcheva, A. (2020, May). A video is worth more than 1000 lies. Comparing 3DCNN approaches for detecting deepfakes. In FG'20, 15th IEEE International Conference on Automatic Face and Gesture Recognition, May 18-22, 2020, Buenos Aires, Argentina.

Wardynski, D. J. (2019, December 19). End Of Moore’s Law - What’s Next For The Future Of Computing. Retrieved May 21, 2021, from https://www.brainspire.com/blog/end-of-moores-law-whats-next-for-the-future-of- computing#:%7E:text=Computer%20systems%20can%20still%20be,just%20at%20a %20slower%20rate.

Williams, E. D. (2004). Environmental impacts of microchip manufacture. Thin Solid Films, 461(1), 2–6. https://doi.org/10.1016/j.tsf.2004.02.049

Xu, J., Glicksberg, B. S., Su, C., Walker, P., Bian, J., & Wang, F. (2020). Federated Learning for Healthcare Informatics. Journal of Healthcare Informatics Research, 5(1), 1–19. https://doi.org/10.1007/s41666-020-00082-4

Zelener, A. (2017). allanzelener/YAD2K. Retrieved April 5, 2021, from https://github.com/allanzelener/YAD2K

48