OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2503.09416
The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2503.09416
- https://arxiv.org/pdf/2503.09416
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415102525
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415102525Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2503.09416Digital Object Identifier
- Title
-
OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space AlignmentWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-03-12Full publication date if available
- Authors
-
Qian Qian, Weiying Xue, Yuxiao Wang, Zhenao WeiList of authors in order
- Landing page
-
https://arxiv.org/abs/2503.09416Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2503.09416Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2503.09416Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415102525 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2503.09416 |
| ids.doi | https://doi.org/10.48550/arxiv.2503.09416 |
| ids.openalex | https://openalex.org/W4415102525 |
| fwci | |
| type | preprint |
| title | OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9997000098228455 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T10812 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9965000152587891 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Human Pose and Action Recognition |
| topics[2].id | https://openalex.org/T11307 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9857000112533569 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Domain Adaptation and Few-Shot Learning |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2503.09416 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2503.09416 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2503.09416 |
| locations[1].id | doi:10.48550/arxiv.2503.09416 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2503.09416 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5027539825 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-8781-3279 |
| authorships[0].author.display_name | Qian Qian |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Liu, Qi |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100312434 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Weiying Xue |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Xue, Weiying |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100721975 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-6640-6076 |
| authorships[2].author.display_name | Yuxiao Wang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wang, Yuxiao |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5054252180 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-6541-1287 |
| authorships[3].author.display_name | Zhenao Wei |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Wei, Zhenao |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2503.09416 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-13T00:00:00 |
| display_name | OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9997000098228455 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2503.09416 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2503.09416 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2503.09416 |
| primary_location.id | pmh:oai:arXiv.org:2503.09416 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2503.09416 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2503.09416 |
| publication_date | 2025-03-12 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 80, 124, 143 |
| abstract_inverted_index.We | 104 |
| abstract_inverted_index.by | 136 |
| abstract_inverted_index.in | 14, 63, 133 |
| abstract_inverted_index.is | 7, 17, 150 |
| abstract_inverted_index.of | 30, 69, 157, 164 |
| abstract_inverted_index.on | 117, 169 |
| abstract_inverted_index.to | 8, 20, 58, 95, 107, 128, 146, 152 |
| abstract_inverted_index.we | 78, 122 |
| abstract_inverted_index.y, | 103 |
| abstract_inverted_index.The | 0 |
| abstract_inverted_index.VLM | 106 |
| abstract_inverted_index.and | 11, 27, 52, 75, 92, 172 |
| abstract_inverted_index.due | 19 |
| abstract_inverted_index.the | 21, 46, 70, 118, 134, 154, 160, 170, 178 |
| abstract_inverted_index.use | 105 |
| abstract_inverted_index.yet | 43 |
| abstract_inverted_index.VLMs | 57 |
| abstract_inverted_index.from | 111 |
| abstract_inverted_index.help | 36 |
| abstract_inverted_index.high | 24 |
| abstract_inverted_index.rich | 90 |
| abstract_inverted_index.show | 176 |
| abstract_inverted_index.task | 6 |
| abstract_inverted_index.text | 109 |
| abstract_inverted_index.that | 177 |
| abstract_inverted_index.Next, | 121 |
| abstract_inverted_index.VLMs' | 89 |
| abstract_inverted_index.VLMs, | 158 |
| abstract_inverted_index.VidOR | 173 |
| abstract_inverted_index.align | 147 |
| abstract_inverted_index.based | 116 |
| abstract_inverted_index.large | 71 |
| abstract_inverted_index.model | 180 |
| abstract_inverted_index.novel | 81 |
| abstract_inverted_index.often | 44 |
| abstract_inverted_index.poses | 65 |
| abstract_inverted_index.tasks | 98 |
| abstract_inverted_index.their | 12, 53 |
| abstract_inverted_index.using | 56 |
| abstract_inverted_index.video | 1, 135 |
| abstract_inverted_index.which | 16, 87 |
| abstract_inverted_index.(VLMs) | 35 |
| abstract_inverted_index.VidVRD | 83, 97, 171 |
| abstract_inverted_index.Visual | 32 |
| abstract_inverted_index.costs, | 26 |
| abstract_inverted_index.derive | 129 |
| abstract_inverted_index.images | 74 |
| abstract_inverted_index.models | 34 |
| abstract_inverted_index.module | 127 |
| abstract_inverted_index.prompt | 100 |
| abstract_inverted_index.public | 174 |
| abstract_inverted_index.region | 114 |
| abstract_inverted_index.spaces | 149 |
| abstract_inverted_index.tasks, | 42 |
| abstract_inverted_index.termed | 85 |
| abstract_inverted_index.videos | 64 |
| abstract_inverted_index.visual | 2, 39, 50, 61 |
| abstract_inverted_index.ability | 163 |
| abstract_inverted_index.because | 68 |
| abstract_inverted_index.between | 48, 73 |
| abstract_inverted_index.develop | 123 |
| abstract_inverted_index.dynamic | 22 |
| abstract_inverted_index.explore | 37 |
| abstract_inverted_index.extract | 108 |
| abstract_inverted_index.harness | 153 |
| abstract_inverted_index.improve | 96 |
| abstract_inverted_index.objects | 10 |
| abstract_inverted_index.overall | 161 |
| abstract_inverted_index.propose | 79 |
| abstract_inverted_index.refiner | 126 |
| abstract_inverted_index.regions | 51 |
| abstract_inverted_index.through | 99 |
| abstract_inverted_index.various | 49 |
| abstract_inverted_index.video's | 119 |
| abstract_inverted_index.videos, | 15 |
| abstract_inverted_index.videos. | 76 |
| abstract_inverted_index.(VidVRD) | 5 |
| abstract_inverted_index.captions | 115 |
| abstract_inverted_index.content, | 23 |
| abstract_inverted_index.datasets | 175 |
| abstract_inverted_index.directly | 59 |
| abstract_inverted_index.employed | 151 |
| abstract_inverted_index.existing | 182 |
| abstract_inverted_index.identify | 9, 60 |
| abstract_inverted_index.language | 33 |
| abstract_inverted_index.methods. | 183 |
| abstract_inverted_index.overlook | 45 |
| abstract_inverted_index.powerful | 93 |
| abstract_inverted_index.proposed | 179 |
| abstract_inverted_index.regions. | 120 |
| abstract_inverted_index.relation | 3, 40 |
| abstract_inverted_index.semantic | 148, 155 |
| abstract_inverted_index.strategy | 145 |
| abstract_inverted_index.Extensive | 166 |
| abstract_inverted_index.Moreover, | 55 |
| abstract_inverted_index.conducted | 168 |
| abstract_inverted_index.detection | 4, 41 |
| abstract_inverted_index.disparity | 72 |
| abstract_inverted_index.enhancing | 159 |
| abstract_inverted_index.generated | 113 |
| abstract_inverted_index.knowledge | 91 |
| abstract_inverted_index.learning. | 101 |
| abstract_inverted_index.relations | 62 |
| abstract_inverted_index.transfers | 88 |
| abstract_inverted_index.Therefore, | 77 |
| abstract_inverted_index.annotation | 25 |
| abstract_inverted_index.challenges | 67 |
| abstract_inverted_index.framework, | 84 |
| abstract_inverted_index.relations. | 31, 54 |
| abstract_inverted_index.OpenVidVRD, | 86 |
| abstract_inverted_index.OpenVidVRD. | 165 |
| abstract_inverted_index.Specificall | 102 |
| abstract_inverted_index.challenging | 18 |
| abstract_inverted_index.connections | 47 |
| abstract_inverted_index.cross-modal | 138 |
| abstract_inverted_index.experiments | 167 |
| abstract_inverted_index.integrating | 137 |
| abstract_inverted_index.long-tailed | 28 |
| abstract_inverted_index.outperforms | 181 |
| abstract_inverted_index.significant | 66 |
| abstract_inverted_index.Furthermore, | 142 |
| abstract_inverted_index.capabilities | 94 |
| abstract_inverted_index.distribution | 29 |
| abstract_inverted_index.information. | 141 |
| abstract_inverted_index.object-level | 130 |
| abstract_inverted_index.relationship | 131 |
| abstract_inverted_index.automatically | 112 |
| abstract_inverted_index.complementary | 140 |
| abstract_inverted_index.prompt-driven | 144 |
| abstract_inverted_index.relationships | 13 |
| abstract_inverted_index.understanding | 156 |
| abstract_inverted_index.generalization | 162 |
| abstract_inverted_index.spatiotemporal | 125, 139 |
| abstract_inverted_index.open-vocabulary | 38, 82 |
| abstract_inverted_index.representations | 110, 132 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |