arXiv (Cornell University)
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs
July 2025 • Jiaming Zhang, Rui Hu, Wei Yang Bryan Lim
Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language gener…