Towards Attention-based Contrastive Learning for Audio Spoof Detection

Published on ISCA Archive

Summary

Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a ViT for audio spoof detection task. We bridge this gap and introduce ViTs for this task. A vanilla baseline built on fine-tuning the SSAST (Gong et. al. '22) audio ViT model achieves sub-optimal equal error rates (EERs). To improve performance, we propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning. Experiments show that our framework successfully disentangles the bonafide and spoof classes and helps learn better classifiers for the task. With appropriate data augmentations policy, a model trained on our framework achieves competitive performance on the ASVSpoof 2021 challenge. We provide comparisons and ablation studies to justify our claim.

Research

All Solutions

Our Technology

Reality Defender Launches Free Access to Deepfake Detection API

Reality Defender Wins “Most Innovative Startup” at RSA Conference Innovation Sandbox

Towards Attention-based Contrastive Learning for Audio Spoof Detection

Published on ISCA Archive

Summary

Read More of Our Peer-Reviewed Research, Published in Top Journals

Patent: Generalizing audio deepfake detection by exploring style-linguistics mismatch

Patent: Data-driven audio deepfake detection

PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors