Written by Nicolò Bonettini, Divyaraj Jayvantsinh Solanki, Surya Koppisetti, and Gaurav Bharaj
Introduction
With the rapid growth in technologies for AI media generation, it has become trivial to generate a deepfake of someone with very few resources and no prior knowledge. To be at the forefront of detecting such existing AI generated media as well as be robust to future generation methods, we require improved and generalizable methods to detect them. To that end, we presented one of our research works called Audio-Visual Feature Fusion (AVFF) at CVPR 2024.
Vision Transformers (ViT) along with the multi-modal Self-Supervised Learning (SSL) paradigm such as OpenAI’s CLIP has shown impressive performance in a variety of Computer Vision tasks, but they have been underexplored in deepfake detection tasks. Most existing detection methods either use uni-modal cues (e.g. only video or only audio) or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the existing deepfake training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes.
We overcome these limitations by employing a two-stage multi-modal learning method that explicitly captures the correspondence between the audio and visual modalities from real videos and exploits the lack of such correspondence to detect deepfaked videos. We utilize the fact that in real video face content, there is inherent rich audio-visual correspondence since there is an intrinsic correlation between the mouth articulations (visemes) and the speech units (phonemes). To capture this correspondence, we design a multi-model self-supervised learning framework motivated by contrastive learning as well as masked imaged modeling described in the following section.
Method
We design the first stage to pursue representation learning via self-supervision on real videos only. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy.
In the second stage, we tune the learned representations for the downstream task of deepfake classification via supervised learning on both real and fake videos.
Stage 1
Drawing inspiration from CAV-MAE [1], we propose a dual self-supervised learning approach that incorporates multi-modal learning from both audio and video as well as autoencoding objectives. We describe the pipeline in Figure 1.
As seen in previous training recipes like CLIP [2], contrastive learning across modalities helps learn robust cross-modal features. But we found in our preliminary experiments that relying solely on it does not establish a strong correspondence between the audio and visual modalities. We therefore augment the method with an autoencoding objective and introduce a complementary masking and cross-modal fusion strategy into the autoencoding framework. Intuitively, we want the model to learn information about audio from video features and vice versa. To achieve this, we selectively mask mutually exclusive time step tokens for audio and video modalities and try to reconstruct masked timesteps’ information from other modality’s visible tokens. Doing so would facilitate the model's capability to maximally learn correspondence between the two modalities. We verified this by passing a random audio clip with a video and observed that the reconstructions were noisier and less faithful as compared to passing proper audio for a corresponding video.
Figure 1. Audio-Visual Representation Learning Stage. A real input sample, x ∈ Dr , with corresponding audio and visual tokens (xa, xv), is split along the temporal dimension, creating K slices (illustrated with K = 8 in the figure). The temporal slices are then encoded using unimodal transformers, Ea and Ev, to yield feature embeddings a and v. We then complementarily mask 50% of the temporal slices in (a, v) with binary masks (Ma, Mv). The visible slices of a and v are passed through A2V and V2A networks respectively, to generate cross-modal slices av and va. The masked slices of a and v are then replaced with the corresponding slices in av and va. The resulting cross-modal fusion representations, a′ and v′, are input to unimodal decoders to obtain the audio and visual reconstructions, xˆa and xˆv .
Stage 2
The goal of this stage is to exploit the cross-modal features learned in the previous stage to detect deepfakes, where either or both audio and visual modalities have been manipulated. For this, we use the encoders and the cross-modal networks trained in the representation learning phase. We train a classifier to tell real videos and deepfakes apart using a supervised learning approach. The classification pipeline is depicted in Figure 2. Here the classifier network is a simple MLP network and the whole architecture is trained with binary cross-entropy loss.
Since the learned representations from stage 1 have a high audio-visual correspondence for real videos, we expect the classifier to exploit the lack of audio-visual cohesion of synthesized samples in distinguishing between real and fake.
Figure 2. Deepfake Classification Stage. Given a sample x ∈ Ddf, consisting of audio and visual inputs xa and xv, we obtain the unimodal features (a,v) and the cross-modal embeddings (av, va). For each modality, the unimodal and cross-modal embeddings are concatenated to obtain (fa, fv). A classifier network is then trained to take (fa, fv) as input and predict if the input is real or fake.
Results
We qualitatively and quantitatively evaluate our models. For qualitative analysis, we explore feature space of the model after stage 1 training on real and fake samples. For quantitative analysis, we use accuracy and AUC metric to measure our model’s ability to detect deepfakes. For our experiments, we used the FakeAVCeleb [3] dataset to evaluate our model’s intra-dataset performance across different generation methods and KoDF [5] dataset to evaluate inter-dataset performance. FakeAVCeleb dataset contains media with Real Video-Real Audio (RVRA), Real Video-Fake Audio (RVFA), Fake Video-Real Audio (FVRA) and Fake Video-Fake Audio (FVFA) generated from Wav2Lip [5], Faceswap [6] and FSGAN [7] for videos and SV2TTS [8] for audio. KoDF dataset contains real and fake videos of southeast asian people where fakes are generated using 6 different methods.
Feature space
For qualitative analysis, we visualize the t-SNE plots of embeddings for random samples from each category of the FakeAVCeleb dataset in Figure. 3. We do not expose Stage 1 to any deepfake samples during training, and still observe clear clustering between real and fake samples. Distinct clusters are evident for each deepfake category which indicates that our representations are not only capable of distinguishing real samples from fakes, but also capturing subtle cues that differentiate different deepfake algorithms without encountering any during the training stage.
A further analysis of the t-SNE visualizations reveals that the samples belonging to adjacent clusters are related in terms of the deepfake algorithms used to generate them. For instance, FVRA-WL and FVFA-WL, which are adjacent, both employ Wav2Lip to synthesize the deepfakes (refer to the encircled regions in Figure 3).
These findings underscore the efficacy of our novel audio-visual representation learning paradigm.
Figure 3. The t-SNE Visualization of the Embeddings at the end of the Representation Learning Stage. A clear distinction is seen between the representations of real and fake videos, as well as be- tween different deepfake categories. Further analysis indicates that samples of adjacent clusters are generated using the same deepfake algorithm, which we encircle manually to highlight the clusters.
Downstream
Real vs Fake
classification
As denoted in Table 1, our approach demonstrates substantial improvements over the existing state-of-the-art, both in audio-visual (AVoiD-DF [9]) and unimodal (RealForensics [10]) deepfake detection. Compared to AVoiD-DF, our method achieves an increase in accuracy of 14.9% (+9.9% in AUC), and compared to RealForensics the accuracy increases by 8.7% (+4.5% AUC).
Overall, the superior performance of audio-visual methods leveraging cross-modal correspondence is evident, outperforming uni-modal approaches that rely on uni-modal artifacts (i.e. visual anomalies) introduced by deepfake algorithms.
Table 1. Intra-dataset performances. We evaluate our method against baselines using a 70%-30% train-test split on the FakeAVCeleb dataset, where we achieve state-of-the-art performance by significant margins. Best result is in bold, and second best per modality is underlined.
We also evaluated performances in the cross-manipulation scenario. To do so we partition the FakeAVCeleb dataset into five categories: RVFA, FVRA-WL, FVFA-FS, FVFA-GAN and FVFA-WL, based on the algorithms used to generate the deepfakes. Using these categories, we evaluate the model leaving one category out for testing while training on the remaining categories. Results are reported in Table 2.
Our method achieves the best performance in almost all cases (and at par with the rest) and notably, yields consistently enhanced performance (AUC > 92+%, AP > 93+%) across all categories, while other baselines (Xception [11], LipForensics [12], FTCN [13], AV- DFD [14]) fall short in categories FVFA-GAN and RVFA. On the evaluation of the KoDF dataset, we see AUC of 95.5% as compared to state-of-the-art model’s, RealForensics, AUC of 93.6% showing our model’s ability to generalize not only to different dataset but also to different demographics.
Table 2. Cross-Manipulation Generalization on FakeAVCeleb. We evaluate the model’s performance by leaving out one category for testing while training on the rest. We consider the following 5 categories in FakeAVCeleb: (i) RVFA: Real Visual - Fake Audio (SV2TTS), (ii) FVRA-WL: Fake Visual - Real Audio (Wav2Lip), (iii) FVFA-FS: Fake Visual - Fake Audio (FaceSwap + Wav2Lip + SV2TTS), (iv) FVFA-GAN: Fake Visual - Fake Audio (FaceSwapGAN + Wav2Lip + SV2TTS), and (v) FVFA-WL: Fake Visual - Fake Audio (Wav2Lip + SV2TTS). The column titles correspond to the category of the test set. AVG-FV corresponds to the average metrics of the categories containing fake visuals. Best result is in bold, and the second best is underlined. Our method yields consistently high performance across all manipulation methods while yielding state-of-the-art performance for several categories.
Summary
In this work, we proposed a multi-modal self-supervised learning framework to enable a ViT based architecture to learn features representing cohesion between audio and video modality from real videos and utilize these learned features for the task of deepfake detection. We explored two orthogonal self-supervised learning methods, contrastive and autoencoding, in conjunction with our novel complementary masking and cross-modal feature fusion techniques for more robust feature learning for downstream tasks. Our contributions lead to a new state-of-the-art model capable of detecting either or both visual and audio manipulations. Our method shows significant improvements in both in-distribution performance as well as generalization on unseen manipulations.
Read more about the paper here in Proceedings of CVPR 2024.
References
[1] Gong, Yuan, et al. "Contrastive audio-visual masked autoencoder." arXiv preprint arXiv:2210.07839 (2022).
[2] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
[3] Khalid, Hasam, et al. "FakeAVCeleb: A novel audio-video multimodal deepfake dataset." arXiv preprint arXiv:2108.05080 (2021).
[4] Kwon, Patrick, et al. "Kodf: A large-scale korean deepfake detection dataset." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[5] Prajwal, K. R., et al. "A lip sync expert is all you need for speech to lip generation in the wild." Proceedings of the 28th ACM international conference on multimedia. 2020.
[6] Korshunova, Iryna, et al. "Fast face-swap using convolutional neural networks." Proceedings of the IEEE international conference on computer vision. 2017.
[7] Nirkin, Yuval, Yosi Keller, and Tal Hassner. "Fsgan: Subject agnostic face swapping and reenactment." Proceedings of the IEEE/CVF international conference on computer vision. 2019.
[8] Jia, Ye, et al. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." Advances in neural information processing systems 31 (2018).
[9] Yang, Wenyuan, et al. "Avoid-df: Audio-visual joint learning for detecting deepfake." IEEE Transactions on Information Forensics and Security 18 (2023): 2015-2029.
[10] Haliassos, Alexandros, et al. "Leveraging real talking faces via self-supervision for robust forgery detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[11] Rossler, Andreas, et al. "Faceforensics++: Learning to detect manipulated facial images." Proceedings of the IEEE/CVF international conference on computer vision. 2019.
[12] Haliassos, Alexandros, et al. "Lips don't lie: A generalisable and robust approach to face forgery detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[13] Zheng, Yinglin, et al. "Exploring temporal coherence for more general video face forgery detection." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[14] Zhou, Yipin, and Ser-Nam Lim. "Joint audio-visual deepfake detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.