CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2507.00817
Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V's potential as a foundational approach for adversarial research across multimodal systems.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2507.00817
- https://arxiv.org/pdf/2507.00817
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416888608
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416888608Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2507.00817Digital Object Identifier
- Title
-
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-07-01Full publication date if available
- Authors
-
Jiaming Zhang, Rui Hu, Wei Yang Bryan LimList of authors in order
- Landing page
-
https://arxiv.org/abs/2507.00817Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2507.00817Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2507.00817Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416888608 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2507.00817 |
| ids.doi | https://doi.org/10.48550/arxiv.2507.00817 |
| ids.openalex | https://openalex.org/W4416888608 |
| fwci | |
| type | preprint |
| title | CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2507.00817 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2507.00817 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2507.00817 |
| locations[1].id | doi:10.48550/arxiv.2507.00817 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2507.00817 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100453777 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-7611-3867 |
| authorships[0].author.display_name | Jiaming Zhang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhang, Jiaming |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5016882456 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-9284-0637 |
| authorships[1].author.display_name | Rui Hu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Hu, Rui |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5027969322 |
| authorships[2].author.orcid | https://orcid.org/0000-0003-2150-5561 |
| authorships[2].author.display_name | Wei Yang Bryan Lim |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Lim, Wei Yang Bryan |
| authorships[2].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2507.00817 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-12-02T09:50:10.106793 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2507.00817 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2507.00817 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2507.00817 |
| primary_location.id | pmh:oai:arXiv.org:2507.00817 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2507.00817 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2507.00817 |
| publication_date | 2025-07-01 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 46, 70, 92, 180 |
| abstract_inverted_index.We | 37 |
| abstract_inverted_index.as | 179 |
| abstract_inverted_index.in | 10, 61 |
| abstract_inverted_index.on | 113, 135, 168 |
| abstract_inverted_index.to | 19, 25, 86 |
| abstract_inverted_index.(1) | 69 |
| abstract_inverted_index.(2) | 91 |
| abstract_inverted_index.Our | 63, 150 |
| abstract_inverted_index.and | 13, 34, 58, 83, 90, 142 |
| abstract_inverted_index.due | 24 |
| abstract_inverted_index.for | 44, 102, 108, 183 |
| abstract_inverted_index.key | 67 |
| abstract_inverted_index.the | 52, 78, 131 |
| abstract_inverted_index.two | 66 |
| abstract_inverted_index.yet | 16 |
| abstract_inverted_index.2.0) | 141 |
| abstract_inverted_index.This | 174 |
| abstract_inverted_index.best | 132 |
| abstract_inverted_index.both | 136 |
| abstract_inverted_index.even | 167 |
| abstract_inverted_index.have | 6 |
| abstract_inverted_index.loss | 73 |
| abstract_inverted_index.over | 130 |
| abstract_inverted_index.text | 80 |
| abstract_inverted_index.than | 160 |
| abstract_inverted_index.that | 49, 75, 98, 119 |
| abstract_inverted_index.with | 105 |
| abstract_inverted_index.22.8% | 127 |
| abstract_inverted_index.Aria, | 148 |
| abstract_inverted_index.Large | 2 |
| abstract_inverted_index.Video | 0 |
| abstract_inverted_index.image | 169 |
| abstract_inverted_index.novel | 47 |
| abstract_inverted_index.shown | 7 |
| abstract_inverted_index.their | 17 |
| abstract_inverted_index.video | 115 |
| abstract_inverted_index.(34.4% | 171 |
| abstract_inverted_index.Gemini | 140 |
| abstract_inverted_index.Models | 4 |
| abstract_inverted_index.across | 186 |
| abstract_inverted_index.attack | 124 |
| abstract_inverted_index.gain). | 173 |
| abstract_inverted_index.logits | 82 |
| abstract_inverted_index.models | 144 |
| abstract_inverted_index.rather | 159 |
| abstract_inverted_index.unique | 26 |
| abstract_inverted_index.visual | 56, 84 |
| abstract_inverted_index.attacks | 21, 134 |
| abstract_inverted_index.average | 128, 172 |
| abstract_inverted_index.between | 55 |
| abstract_inverted_index.complex | 28 |
| abstract_inverted_index.model's | 79 |
| abstract_inverted_index.present | 38 |
| abstract_inverted_index.remains | 22 |
| abstract_inverted_index.systems | 138 |
| abstract_inverted_index.targets | 51 |
| abstract_inverted_index.through | 154 |
| abstract_inverted_index.Language | 3 |
| abstract_inverted_index.V-MLLMs. | 62 |
| abstract_inverted_index.Videos), | 45 |
| abstract_inverted_index.Yielding | 43 |
| abstract_inverted_index.achieves | 152 |
| abstract_inverted_index.approach | 64, 182 |
| abstract_inverted_index.baseline | 133 |
| abstract_inverted_index.combines | 99 |
| abstract_inverted_index.critical | 53 |
| abstract_inverted_index.directly | 50 |
| abstract_inverted_index.disrupts | 77 |
| abstract_inverted_index.enabling | 163 |
| abstract_inverted_index.existing | 123 |
| abstract_inverted_index.explicit | 161 |
| abstract_inverted_index.function | 74 |
| abstract_inverted_index.implicit | 155 |
| abstract_inverted_index.language | 59 |
| abstract_inverted_index.methods, | 125 |
| abstract_inverted_index.modeling | 158 |
| abstract_inverted_index.research | 185 |
| abstract_inverted_index.systems. | 188 |
| abstract_inverted_index.temporal | 11, 32, 156 |
| abstract_inverted_index.(GPT-4.1, | 139 |
| abstract_inverted_index.(V-MLLMs) | 5 |
| abstract_inverted_index.CAVALRY-V | 39, 120 |
| abstract_inverted_index.Empirical | 111 |
| abstract_inverted_index.achieving | 126 |
| abstract_inverted_index.coherence | 157 |
| abstract_inverted_index.efficient | 94 |
| abstract_inverted_index.framework | 48, 97, 151 |
| abstract_inverted_index.generator | 96 |
| abstract_inverted_index.interface | 54 |
| abstract_inverted_index.potential | 178 |
| abstract_inverted_index.reasoning | 12, 30 |
| abstract_inverted_index.two-stage | 95 |
| abstract_inverted_index.undermine | 87 |
| abstract_inverted_index.Multimodal | 1 |
| abstract_inverted_index.benchmarks | 117 |
| abstract_inverted_index.capability | 175 |
| abstract_inverted_index.coherence. | 110 |
| abstract_inverted_index.commercial | 137 |
| abstract_inverted_index.evaluation | 112 |
| abstract_inverted_index.generation | 60, 81 |
| abstract_inverted_index.impressive | 8 |
| abstract_inverted_index.introduces | 65 |
| abstract_inverted_index.multimodal | 187 |
| abstract_inverted_index.perception | 57 |
| abstract_inverted_index.Adversarial | 42 |
| abstract_inverted_index.CAVALRY-V's | 177 |
| abstract_inverted_index.adversarial | 20, 184 |
| abstract_inverted_index.challenges: | 27 |
| abstract_inverted_index.cross-modal | 14, 29, 88 |
| abstract_inverted_index.cross-model | 103 |
| abstract_inverted_index.fine-tuning | 107 |
| abstract_inverted_index.flexibility | 153 |
| abstract_inverted_index.improvement | 129 |
| abstract_inverted_index.large-scale | 100 |
| abstract_inverted_index.mechanisms, | 31 |
| abstract_inverted_index.open-source | 143 |
| abstract_inverted_index.outperforms | 122 |
| abstract_inverted_index.performance | 165 |
| abstract_inverted_index.significant | 164 |
| abstract_inverted_index.specialized | 106 |
| abstract_inverted_index.(Cross-modal | 40 |
| abstract_inverted_index.(QwenVL-2.5, | 145 |
| abstract_inverted_index.Llava-Video, | 147 |
| abstract_inverted_index.capabilities | 9 |
| abstract_inverted_index.constraints. | 36 |
| abstract_inverted_index.demonstrates | 118, 176 |
| abstract_inverted_index.foundational | 181 |
| abstract_inverted_index.improvements | 166 |
| abstract_inverted_index.innovations: | 68 |
| abstract_inverted_index.integration, | 89 |
| abstract_inverted_index.pre-training | 101 |
| abstract_inverted_index.InternVL-2.5, | 146 |
| abstract_inverted_index.comprehensive | 114 |
| abstract_inverted_index.computational | 35 |
| abstract_inverted_index.dependencies, | 33 |
| abstract_inverted_index.significantly | 121 |
| abstract_inverted_index.underexplored | 23 |
| abstract_inverted_index.understanding | 116, 170 |
| abstract_inverted_index.vulnerability | 18 |
| abstract_inverted_index.dual-objective | 71 |
| abstract_inverted_index.simultaneously | 76 |
| abstract_inverted_index.spatiotemporal | 109 |
| abstract_inverted_index.understanding, | 15 |
| abstract_inverted_index.Language-Vision | 41 |
| abstract_inverted_index.MiniCPM-o-2.6). | 149 |
| abstract_inverted_index.computationally | 93 |
| abstract_inverted_index.regularization, | 162 |
| abstract_inverted_index.representations | 85 |
| abstract_inverted_index.semantic-visual | 72 |
| abstract_inverted_index.transferability | 104 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |