Evaluation of CNNs for Deepfake Detection

Abstract

“Deepfake” (which is derived from deep learning and fake) is synthetic content in which an individual in digital media is replaced, partly or completely, by another person using deep learning. Recent advancements have allowed the manipulation of individuals in digital media to state and perform actions that never occurred. Deepfake technology has led to the rise of certain issues, mainly ones concerning the validity of information and the spread of misinformation and disinformation. Misinformation and disinformation have the potential to threaten national security and international diplomacy by spreading political propaganda and interfering with governmental elections, or by lowering the trust of the public towards government officials and organizations. High-accuracy deep learning can be effective in detecting and flagging deepfake content (i.e. videos). In this research, a convolutional neural network (CNN) is trained using stochastic gradient descent (SGD) and Deepfake Detection Challenge (DFDC) to detect deepfake. This machine learning (ML) model achieved a final validation accuracy of 80 percent with 20 epochs. This indicates that CNNs provide an efficient solution to the spread of deepfake. Improvements could be utilizing the full DFDC dataset, utilizing dropout layers, or utilizing transfer learning (TF).

Keywords: convolutional neural network, deepfake, deepfake detection, fake news, evaluation.

1. Introduction

“Deepfakes” are artificially-created realistic pieces of media that, using deep learning, contain individuals stating and performing actions that did not necessarily happen [9]. They are created using neural networks that examine and analyze datasets to learn to “mimic a person’s facial expressions, mannerisms, voice, and inflections” [9]. Although it may be beneficial in the film and visual effects industry, this tool is problematic and even dangerous. Deepfakes are often used as fake news and misinformation [9]. Correspondingly, misinformation produced with the intent of deceiving individuals is disinformation. Misinformation and disinformation undermine the reality, forcing people to believe in incorrect and unreliable information that may lead to insecurity within a country, interfere with organizational and governmental operations, and cause distrust and separation amongst the population of a country. In addition, due to the reason that deepfakes can misrepresent the ideas, opinions, and beliefs of an individual, deepfakes are also considered a form of identity theft by the European Association of Communications Agencies [4]. Visual illustration of people performing an action is much more trustworthy than fake statements within an article, and hence, if misused, are much more likely to influence another person’s perception of an individual. As deepfakes are often great catalysts of misinformation and disinformation, in addition to misrepresenting the identity of individuals, it is ethically and morally justified for deepfakes to be detected and flagged. Consequently, this project is designed to evaluate the ability of a convolutional neural network and stochastic gradient descent to accurately and reliably detect deepfakes.

2. Approach

The solution has three main components: a dataset, a neural network model, and a machine learning algorithm.

2.1. Dataset Preprocessing

The dataset utilized, named Deepfake Detection Challenge (DFDC), was created and produced collaboratively by Facebook AI, and a number of other organizations [3, 5]. The dataset contained 400 training videos and 400 testing videos [5]. With that said, we were only able to utilize the 400 training videos, since the labels for the 400 testing videos were not disclosed to the public. All the training videos had 300 image frames, and for the purpose of not having too much and not too little data, we decided that 10 image frames will be saved from each video. This resulted in 4000 image frames. The modification to the type of data (video to image) was necessary because image formatted data was applicable to CNNs. To further prepare the data, a facial recognition algorithm provided by Face Recognition Python library was used to find and crop each image to the appropriate face within the image. Using the library, if there were no faces recognized, the image was discarded. After the facial recognition preprocessing, 3,989 were left, with 3,221 of them being labeled fake images and 768 of them being labeled real images based on their initial label.

2.2. Neural Network Architecture

A convolutional neural network model was created using TensorFlow Python library. Convolutional neural networks have three main components other than the conventional neural network layers. A feature detector, or often called a kernel or filter [see Figure 1], is moved through a twodimensional (2D) vector containing numbers that correspond with the image pixel value [10]. The 2D vector is called a tensor. After the kernel examines the number data of all the pixel values, the information is convoluted using different operations [10]. As an example, in Figure 1, the pixel values within the kernel are compared to the pixel value in the image, and the number of identical values are summed up and outputted onto the feature map.

Figure 1: Convolutional operations with a 3 × 3 kernel [10]

The pooling layer reduces the dimensions of the feature map [10]. There are numerous methods of achieving this, and the most popular one is max pooling. In max pooling, a patch of pixels are examined, and the one with the highest value is saved, while the others are discarded. This results in a smaller tensor [see Figure 2 for an example] [10]. Within the flattening layer, the 2D tensor is converted to a one-dimensional (1D) vector of numbers. This step is performed because neural networks cannot have multidimensional vectors as input [10]. Our model has two convolutional layers, two max pooling layers, and one flattening layer. There is one NN hidden layer added to the architecture of our models with 16 nodes.

Figure 2: Max Pooling with a four-by-four (4 × 4) Input tensor & 2 × 2 output tensor (Image by Author)

2.3. Machine Learning Algorithm

The machine learning algorithm uses the data to optimize the CNN architecture to detect deepfake. More specifically, we utilized stochastic gradient descent (SGD), which in recent years, has become a well-known ML algorithm for image classification models. Within neural networks, there are connections between the nodes of each layer that are called weights [see Figure 3. Variable ”W”] [6].

Figure 3: Nodes within artificial neural networks (Image by Author)

In addition, each node has its own bias [Variable ”b”]. Weights and biases are numerical values, and they determine the final output based on the input [6]. They determine the importance of the features extracted from the image input. SGD is a set of mathematical calculations using derivatives that modifies each of the weights and biases throughout the training period (epoch) for each node to achieve the highest number of correct predictions [7]. The goal is to minimize the cost function [7].

3. Results & Analysis

The final training accuracy of the deepfake detection model, trained using 20 epochs (training rounds), was 80.63 percent, with the validation accuracy being 80.43 percent.

Figure 4: The graph represents both the training and validation accuracy of the CNN model over the period of 20 epochs.

The highest validation accuracy achieved within the Deepfake Detection Challenge reported by Facebook was 65.18 percent. The validation dataset that we used was derived from the initial training samples. 20 percent of the total images were saved as validation dataset, meaning that they were never introduced to the model within the training process. The 20 percent of image frames were only used for testing and validating the accuracy of the model. The 65.18 percent validation accuracy reported by Facebook was calculated based on a different dataset [5]. With that said, considering that our validation dataset was, similar to Facebook’s, never introduced to the model other than for validation, we deduce that the accuracy of our model would not reduce substantially if it was validated using the same dataset used by Facebook. The 80.43 percent validation accuracy is an indication that the model functions effectively compared to the highest accuracy reported by Facebook even if the validation datasets differed. One interesting pattern that appeared within the results was that the validation accuracy was often greater than the training accuracy. This indicates that the model was likely to be more tuned and trained to detect deepfake within the validation dataset than the training dataset. Nevertheless, the highest difference between validation and training accuracy was 0.55 percent, which compared to the final 80.43 percent validation accuracy, is insignificant. Throughout the training period of the model, from epoch zero to nineteen (twenty epochs in total), a linear trendline appeared with the equation presented below: y = 0.148x + 78.1 This may indicate a constant increase in accuracy in relation to the increased number of training epochs. However, that trend will presumably not continue forever due to overfitting. Overfitting is an instance within the training period where the validation accuracy of a model begins decreasing, while the training accuracy increases [2]. This event is due to the model beginning to identify features that are only unique to the training data and not to the validation data. It is a result of lack of generalization. Overfitting appears within the training period of our model: at epoch sixteen, the highest validation accuracy of 80.80% is achieved, and at epoch seventeen, eighteen, and nineteen the validation accuracy decreases to 80.68 % , 80.68 % , and 80.43 % , respectively. Compared to the training accuracy that is 80.19 % , 80.34 % , 80.38 % , and 80.63 % at epochs sixteen to nineteen respectively, it indicates overfitting.

4. Conclusion

Convolutional neural networks and stochastic gradient descent can be used as high-accuracy tools to detect deepfake media within social media and informational platforms. Deepfake detection models could be utilized by various organizations to detect deepfake more efficiently with less resources and time. For confirmation and further evaluation of the media, they could require a human individual to conduct in-depth examinational research on the media piece, providing the final verdict on its status as a fake or real content. The repository for the code of this project can be found at this link.

5. Improvement

Certain aspects of this project can be modified or enhanced for us to achieve higher final validation accuracy. 5.1 More Epochs The ML model in this paper was trained using 20 epochs. Technically, by increasing the number of epochs, the accuracy of the model should increase in a linear manner. However, overfitting has to be considered to find the optimal number of epochs. 4 5.2 Full DFDC Dataset The Deepfake Detection Challenge (DFDC) dataset that was used only contained 400 training videos that were broken down into image frames. Access to the full dataset requires an application process, and if permitted, 124,000 videos can be accessed for training [DFDC2020]. Accessing the full dataset would increase the generalization and hence, real-world accuracy of the final ML model. However, it would require more resources, including more time and energy to train and create the model. 5.3 Transfer Learning Our model was achieved through trial and error, since the final accuracy of a model cannot be theoretically calculated. There are no deterministic principles and guidelines that can be utilized to determine the output result of a neural network. Neural networks are referred to as “black boxes” for this aspect. However, in machine learning, transfer learning is the field in which knowledge and creations that were designed for other problems are fit to solve multiple other problems. An example of transfer learning is Visual Geometry Group (VGG) [8]. TF model architectures can fulfill the desire of multiple ML applications, potentially including ours. 5.4 Dropout Layer Dropout layers randomly determine whether a node is 0 or 1 at a frequency rate to prevent overfitting [1]. The use of this allows the ML model to find the best nodes that return predictions with highest accuracies.

6. Further Inquiry

The full DFDC dataset could be investigated to examine whether higher accuracy can be achieved with the same ML architecture, algorithm, and number of epochs. This will require a longer training period, including more energy, however, the final model will be exponentially more generalized.

References

[1] Pierre Baldi and Peter J Sadowski. “Understanding dropout”. In: Advances in neural information processing systems 26 (2013).

[2] Tom Dietterich. “Overfitting and undercomputing in machine learning”. In: ACM computing surveys (CSUR) 27.3 (1995), pp. 326–327.

[3] Brian Dolhansky et al. “The deepfake detection challenge (dfdc) dataset”. In: arXiv preprint arXiv:2006.07397 (2020).

[4] EACA. Deepfake the fake that ’steals’ your face (and privacy). July 2021. url: https : / / eaca . eu / news / deepfake- the- fake- that- stealsyour-face-and-privacy/#:~:text= Deepfakes % 20are % 20a % 20severe % 20form , they % 20express % 20in % 20the%20videos.

[5] Cristian Canton Ferrer et al. Deepfake Detection Challenge Results: An open initiative to advance AI. June 2020. url: https : / / ai . facebook . com / blog / deepfake - detection - challenge - results - an - open - initiative-to-advance-ai/.

[6] Harrison Kinsley and Daniel Kukie la. Neural networks from scratch in Python Building Neural Networks in raw python. Kinsley.

[7] XuanKhanh Nguyen. Minimizing the cost function: Gradient descent. Aug. 2020. url: https : / / towardsdatascience . com / minimizing - the - cost - function - gradient-descent-a5dd6b5350e1.

[8] Lisa Torrey and Jude Shavlik. “Transfer learning”. In: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, 2010, pp. 242–264.

[9] Mika Westerlund. “The emergence of deepfake technology: A review”. In: Technology Innovation Management Review 9.11 (2019). 5

[10] Rikiya Yamashita et al. “Convolutional neural networks: an overview and application in radiology”. In: Insights into imaging 9.4 (2018), pp. 611– 629.