VCapAV Dataset: Audio-Visual Deepfake Detection

@inproceedings{wang25q_interspeech, title = {{VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset}}, author = {Yuxi Wang and Yikang Wang and Qishan Zhang and Hiromitsu Nishizaki and Ming Li}, year = {2025}, booktitle = {{Interspeech 2025}}, pages = {3908--3912}, doi = {10.21437/Interspeech.2025-1713}, issn = {2958-1796}, }
Sample 1: VpZ7mkQBLrY_000065

📝 Video Caption

popcorn is being cooked

🔊 Audio

Bonafide

AudioCraft

AudioLDM1

AudioLDM2

V2A-MLP

V2A-Mapper

🎥 Video

Bonafide

Spoof (Kling)

Sample 2: Gdo_f-4odEY_000054

📝 Video Caption

two women playing racquetball

🔊 Audio

Bonafide

AudioCraft

AudioLDM1

AudioLDM2

V2A-MLP

V2A-Mapper

🎥 Video

Bonafide

Spoof (Kling)

Sample 3: Y5ZGlOUUcs8_000290

📝 Video Caption

doves are sitting on a wall

🔊 Audio

Bonafide

AudioCraft

AudioLDM1

AudioLDM2

V2A-MLP

V2A-Mapper

🎥 Video

Bonafide

Spoof (Kling)