Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2510.02155
Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2510.02155
- https://arxiv.org/pdf/2510.02155
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414821035
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414821035Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2510.02155Digital Object Identifier
- Title
-
Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained PromptingWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-10-02Full publication date if available
- Authors
-
Shu Zou, Xinyu Tian, Lukas Wesemann, Fabian Waschkowski, Zhaoyuan Yang, Jing ZhangList of authors in order
- Landing page
-
https://arxiv.org/abs/2510.02155Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2510.02155Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2510.02155Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414821035 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2510.02155 |
| ids.doi | https://doi.org/10.48550/arxiv.2510.02155 |
| ids.openalex | https://openalex.org/W4414821035 |
| fwci | |
| type | preprint |
| title | Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11512 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9976999759674072 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Anomaly Detection Techniques and Applications |
| topics[1].id | https://openalex.org/T12357 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9696000218391418 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Digital Media Forensic Detection |
| topics[2].id | https://openalex.org/T10400 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9581000208854675 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1705 |
| topics[2].subfield.display_name | Computer Networks and Communications |
| topics[2].display_name | Network Security and Intrusion Detection |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2510.02155 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2510.02155 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2510.02155 |
| locations[1].id | doi:10.48550/arxiv.2510.02155 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2510.02155 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101292475 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Shu Zou |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zou, Shu |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5029944382 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-1247-6076 |
| authorships[1].author.display_name | Xinyu Tian |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Tian, Xinyu |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5030101968 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-9142-1342 |
| authorships[2].author.display_name | Lukas Wesemann |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wesemann, Lukas |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5089615888 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-5427-9551 |
| authorships[3].author.display_name | Fabian Waschkowski |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Waschkowski, Fabian |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5103063628 |
| authorships[4].author.orcid | https://orcid.org/0009-0007-0294-4741 |
| authorships[4].author.display_name | Zhaoyuan Yang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Yang, Zhaoyuan |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5080322771 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-5009-7707 |
| authorships[5].author.display_name | Jing Zhang |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Zhang, Jing |
| authorships[5].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2510.02155 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11512 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9976999759674072 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Anomaly Detection Techniques and Applications |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2510.02155 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2510.02155 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2510.02155 |
| primary_location.id | pmh:oai:arXiv.org:2510.02155 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2510.02155 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2510.02155 |
| publication_date | 2025-10-02 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 4, 43, 145 |
| abstract_inverted_index.We | 40 |
| abstract_inverted_index.as | 3, 144 |
| abstract_inverted_index.in | 37 |
| abstract_inverted_index.of | 138 |
| abstract_inverted_index.on | 90 |
| abstract_inverted_index.or | 30 |
| abstract_inverted_index.to | 7, 51, 107 |
| abstract_inverted_index.AUC | 99 |
| abstract_inverted_index.Our | 61 |
| abstract_inverted_index.VLM | 130 |
| abstract_inverted_index.and | 55, 75, 92, 110, 123, 129, 141, 148 |
| abstract_inverted_index.are | 21 |
| abstract_inverted_index.for | 13, 151 |
| abstract_inverted_index.has | 1 |
| abstract_inverted_index.new | 146 |
| abstract_inverted_index.our | 115 |
| abstract_inverted_index.the | 26, 135 |
| abstract_inverted_index.way | 6 |
| abstract_inverted_index.Yet, | 18 |
| abstract_inverted_index.both | 108 |
| abstract_inverted_index.from | 58 |
| abstract_inverted_index.into | 65 |
| abstract_inverted_index.more | 53 |
| abstract_inverted_index.over | 100 |
| abstract_inverted_index.role | 137 |
| abstract_inverted_index.show | 94 |
| abstract_inverted_index.that | 33, 47, 80, 95 |
| abstract_inverted_index.with | 84 |
| abstract_inverted_index.(e.g. | 69 |
| abstract_inverted_index.These | 132 |
| abstract_inverted_index.VLMs. | 60 |
| abstract_inverted_index.adapt | 8 |
| abstract_inverted_index.align | 81 |
| abstract_inverted_index.cues. | 87 |
| abstract_inverted_index.model | 82 |
| abstract_inverted_index.often | 22 |
| abstract_inverted_index.prior | 101 |
| abstract_inverted_index.video | 14, 153 |
| abstract_inverted_index.(VAD). | 17 |
| abstract_inverted_index.(VLMs) | 12 |
| abstract_inverted_index.Beyond | 113 |
| abstract_inverted_index.across | 127 |
| abstract_inverted_index.action | 31 |
| abstract_inverted_index.define | 34 |
| abstract_inverted_index.elicit | 52 |
| abstract_inverted_index.frozen | 9, 59 |
| abstract_inverted_index.groups | 68 |
| abstract_inverted_index.models | 11 |
| abstract_inverted_index.overly | 23 |
| abstract_inverted_index.prompt | 139 |
| abstract_inverted_index.public | 73 |
| abstract_inverted_index.strong | 125 |
| abstract_inverted_index.traces | 120 |
| abstract_inverted_index.visual | 86 |
| abstract_inverted_index.anomaly | 15, 122, 154 |
| abstract_inverted_index.complex | 35 |
| abstract_inverted_index.crimes, | 72 |
| abstract_inverted_index.emerged | 2 |
| abstract_inverted_index.guiding | 78 |
| abstract_inverted_index.prompts | 20, 64 |
| abstract_inverted_index.propose | 41 |
| abstract_inverted_index.results | 133 |
| abstract_inverted_index.safety) | 74 |
| abstract_inverted_index.towards | 121 |
| abstract_inverted_index.videos. | 39 |
| abstract_inverted_index.ASK-Hint | 96, 143 |
| abstract_inverted_index.accurate | 54 |
| abstract_inverted_index.approach | 62 |
| abstract_inverted_index.coherent | 67 |
| abstract_inverted_index.compared | 106 |
| abstract_inverted_index.critical | 136 |
| abstract_inverted_index.datasets | 128 |
| abstract_inverted_index.existing | 19 |
| abstract_inverted_index.improves | 98 |
| abstract_inverted_index.methods. | 112 |
| abstract_inverted_index.property | 71 |
| abstract_inverted_index.provides | 117 |
| abstract_inverted_index.solution | 150 |
| abstract_inverted_index.ASK-Hint, | 42 |
| abstract_inverted_index.Extensive | 88 |
| abstract_inverted_index.Prompting | 0 |
| abstract_inverted_index.UCF-Crime | 91 |
| abstract_inverted_index.abstract, | 24 |
| abstract_inverted_index.accuracy, | 114 |
| abstract_inverted_index.achieving | 103 |
| abstract_inverted_index.anomalies | 36 |
| abstract_inverted_index.detection | 16 |
| abstract_inverted_index.establish | 142 |
| abstract_inverted_index.framework | 46, 116 |
| abstract_inverted_index.highlight | 134 |
| abstract_inverted_index.knowledge | 50 |
| abstract_inverted_index.leverages | 48 |
| abstract_inverted_index.organizes | 63 |
| abstract_inverted_index.practical | 5 |
| abstract_inverted_index.prompting | 45 |
| abstract_inverted_index.questions | 79 |
| abstract_inverted_index.reasoning | 57, 119 |
| abstract_inverted_index.semantics | 32 |
| abstract_inverted_index.violence, | 70 |
| abstract_inverted_index.backbones. | 131 |
| abstract_inverted_index.baselines, | 102 |
| abstract_inverted_index.detection. | 155 |
| abstract_inverted_index.fine-tuned | 109 |
| abstract_inverted_index.formulates | 76 |
| abstract_inverted_index.structured | 44 |
| abstract_inverted_index.XD-Violence | 93 |
| abstract_inverted_index.experiments | 89 |
| abstract_inverted_index.explainable | 152 |
| abstract_inverted_index.granularity | 140 |
| abstract_inverted_index.overlooking | 25 |
| abstract_inverted_index.performance | 105 |
| abstract_inverted_index.predictions | 83 |
| abstract_inverted_index.consistently | 97 |
| abstract_inverted_index.demonstrates | 124 |
| abstract_inverted_index.fine-grained | 27, 77 |
| abstract_inverted_index.human-object | 28 |
| abstract_inverted_index.interactions | 29 |
| abstract_inverted_index.semantically | 66 |
| abstract_inverted_index.surveillance | 38 |
| abstract_inverted_index.generalizable | 149 |
| abstract_inverted_index.interpretable | 56, 118 |
| abstract_inverted_index.training-free | 111, 147 |
| abstract_inverted_index.action-centric | 49 |
| abstract_inverted_index.discriminative | 85 |
| abstract_inverted_index.generalization | 126 |
| abstract_inverted_index.vision-language | 10 |
| abstract_inverted_index.state-of-the-art | 104 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |