Attention-Based Contrastive Learning for Audio Spoof Detection

Reality Defender

Team

Written by Chirag Goel, Surya Koppisetti, and Gaurav Bharaj

Introduction

With new technologies coming up for audio deepfake generation, it has become very easy to generate a deepfake of someone by using just a few minutes of their speech. Given the easy availability of these technologies, it has become very important to develop robust deepfake detection systems. At Reality Defender we are at the forefront of detecting malicious deepfake audios. One of our research works on the topic was presented at INTERSPEECH 2023.

After dominating the field of natural language processing (NLP), attention-based models like vision transformers (ViT) have made substantial progress for computer vision. More recently, researchers have demonstrated their abilities for speech/audio classification tasks. But a similar advancement has not been made for audio deepfake detection.

At Reality Defender, we curiously set out to explore ViTs for robust, real-world audio deepfake detection. ViTs are usually pre-trained on real-only datasets, such as Imagenet [1], and AudioSet [2]., which means they don't see any deepfake-synthesized data during pre-training. Today’s deepfake synthesis methods generate data that is only subtly different from real data, and these differences are not captured by pretraining on real-only datasets. This makes it difficult to directly use these off-the-shelf pre-trained models for the deepfake detection task. Pre-training typically requires millions of data points, but there are no available audio deepfake datasets of that size. For example, the ASVSpoof2021 [3] audio deepfake detection dataset contains only 25,380 training samples in the Logical Access track. Adapting the ViTs for audio deepfake detection is therefore a non-trivial task.

To overcome the limitation, we need a training recipe that can tune the weights in a ViT model when a limited amount of training data is available. As a solution, we pursued contrastive learning and built a two-stage training framework, outlined next.

Modeling Approach

We introduce an audio ViT to learn efficient representations for deepfake detection. As an example model architecture, we consider the self-supervised audio spectrogram transformer (SSAST) model [4]. To train the SSAST, we propose a two-stage framework based on contrastive learning. The goal of Stage 1 is to learn discriminative representations for the real/fake classes and that of Stage 2 is to use these representations for building an efficient classifier. For representation learning in Stage I, we consider the popular Siamese network [5] recipe and introduce a cross-attention branch into the conventional Siamese framework to learn more discriminative representations for real and synthesized audio. For the classification in Stage 2, we use a simple multi-layer perceptron (MLP). Our audio ViT outperforms the baseline models in ASVSpoof 2021 challenge, and competes with other best-performing models.

Fig 1: SSAST-CL: A two-stage contrastive learning framework to train the SSAST model for audio spoof detection. In Stage I, we employ Siamese training with weight-sharing across two multi-head self-attention (MH-SA) and one multi-head cross-attention (MH-CA) branches. Model weights are learned using a contrastive loss which measures the (dis-)similarity between the self and cross-attention representations (r_SA¹, r_SA², r_CA¹²₎. In Stage II, an MLP classifies the learned representation as real or fake.

The overall architecture is summarized in Fig. 1 (see above). In Stage I, we adapt the SSAST architecture to a 3-branch Siamese training framework, referred to here on as SSAST-CL. It takes a pair of spectrogram inputs (x₁, x₂) and passes them in parallel through two self-attention and one novel cross-attention branch. The cross attention branch is introduced to force the 2 self-attention branches to learn the discriminative features between the two inputs. Given an input pair (x₁, x₂), if the labels of the pair are the same, the representations from the 3 branches are pushed closer to each other. On the other hand, if the labels are different, the representations are pushed apart. This helps to train the model to learn discriminative representations for the real and fake classes.

In Stage II, we freeze the SSAST-CL and train a simple MLP on the representations obtained using either of the self-attention branches.

To create a robust system, we selected on-the-fly data augmentations that can help us tackle problems like overfitting, telephony codecs and speaker variability.

Results

‍

Fig 2: EER performance of SSAST-CL on ASVSpoof 2021 LA evaluation set.

Fig 2 shows us the results for the impact of our proposed architecture, quantified using the equal error rate (EER) metric. The EER is the threshold at which the false acceptance rate (FAR) and false rejection rate (FRR) of a model are equal. Lower EERs indicate better performance. We can see that by simply using the proposed list of augmentations on the vanilla SSAST (WCE(augs)) we get a performance improvement of 10.52%. When we use the proposed SSAST-CL architecture, with the proposed augmentations on top, a further improvement of 4.22% is observed.

Fig 3: Examples of disentangled representations after Stage I training

Fig 3, shows the disentanglement in the feature embedding space after our Stage I training. We projected the vectors/ representations obtained after the SSAST-CL to 2-D for plotting the t-SNE plots[6]. We can clearly see that the Vanilla WCE baseline (vanilla SSAST) does not produce discriminative representations between the real/fake classes. When we move to the contrastive training paradigm, we observe more discriminative representations, with the best separation seen when both the self- and cross-attentions branches are used.

Summary

In this work, we tried to answer the question “Can we leverage ViTs for the audio spoof detection task?”. We conducted investigations on our proposed SSAST-CL architecture and empirically showed that it learns better discriminative representations than the vanilla SSAST model and hence learns a better classifier for the deepfake detection task. The introduction of cross-attention, along with suitable augmentations, has allowed our system to achieve competitive performance on the ASVSpoof 2021 challenge.

Read more about the paper here, in Proc. INTERSPEECH 2023

References

[1] Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009. https://ieeexplore.ieee.org/document/5206848.

[2] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. https://ieeexplore.ieee.org/document/7952261.

[3] Yamagishi, Junichi, et al. "ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection." ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoofing Coutermeasures Challenge. 2021. https://arxiv.org/abs/2109.00537.

[4]Gong, Y., Lai, C.-I., Chung, Y.-A., & Glass, J. (2022). SSAST: Self-Supervised Audio Spectrogram Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 10699-10709. https://doi.org/10.1609/aaai.v36i10.21315

[5]Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. No. 1. 2015.

[6]Van der Maaten, Laurens, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9.11 (2008).

‍

Insights