Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2509.05751
Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual-language fusion that struggles with complex, compositional descriptions. In this paper, we propose \textbf{PARSE-VOS}, a novel, training-free framework powered by Large Language Models (LLMs), for a hierarchical, coarse-to-fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio-temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two-stage reasoning process: it first performs coarse-grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine-grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. \textbf{PARSE-VOS} achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2509.05751
- https://arxiv.org/pdf/2509.05751
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415059664
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415059664Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2509.05751Digital Object Identifier
- Title
-
Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object SegmentationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-09-06Full publication date if available
- Authors
-
Bingrui Zhao, Lin Wu, Xing-Lang Fan, Deyin Liu, Lu Zhang, Ruyi He, Jialie Shen, Ximing LiList of authors in order
- Landing page
-
https://arxiv.org/abs/2509.05751Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2509.05751Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2509.05751Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415059664 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2509.05751 |
| ids.doi | https://doi.org/10.48550/arxiv.2509.05751 |
| ids.openalex | https://openalex.org/W4415059664 |
| fwci | |
| type | preprint |
| title | Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9660000205039978 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9623000025749207 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| topics[2].id | https://openalex.org/T10028 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9495000243186951 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2509.05751 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2509.05751 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2509.05751 |
| locations[1].id | doi:10.48550/arxiv.2509.05751 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | public-domain |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/public-domain |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2509.05751 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5033721709 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-3731-5204 |
| authorships[0].author.display_name | Bingrui Zhao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhao, Bingrui |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5083453098 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-6119-058X |
| authorships[1].author.display_name | Lin Wu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Wu, Lin Yuanbo |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5040138093 |
| authorships[2].author.orcid | https://orcid.org/0009-0005-8259-1573 |
| authorships[2].author.display_name | Xing-Lang Fan |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Fan, Xiangtian |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5086497042 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-0371-9921 |
| authorships[3].author.display_name | Deyin Liu |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Liu, Deyin |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5035359666 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-8859-5453 |
| authorships[4].author.display_name | Lu Zhang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhang, Lu |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5102607482 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Ruyi He |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | He, Ruyi |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5026951130 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-4560-8509 |
| authorships[6].author.display_name | Jialie Shen |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Shen, Jialie |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5100751177 |
| authorships[7].author.orcid | https://orcid.org/0000-0003-4022-1273 |
| authorships[7].author.display_name | Ximing Li |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Li, Ximing |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2509.05751 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-11T00:00:00 |
| display_name | Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9660000205039978 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2509.05751 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2509.05751 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2509.05751 |
| primary_location.id | pmh:oai:arXiv.org:2509.05751 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2509.05751 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2509.05751 |
| publication_date | 2025-09-06 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 13, 17, 49, 65, 76, 101, 121, 130, 150 |
| abstract_inverted_index.In | 59 |
| abstract_inverted_index.an | 8, 141, 164 |
| abstract_inverted_index.by | 70, 88, 116 |
| abstract_inverted_index.if | 147 |
| abstract_inverted_index.in | 24 |
| abstract_inverted_index.is | 155, 163 |
| abstract_inverted_index.it | 134 |
| abstract_inverted_index.of | 10 |
| abstract_inverted_index.on | 16, 48, 176 |
| abstract_inverted_index.to | 6, 143, 158 |
| abstract_inverted_index.we | 62, 99 |
| abstract_inverted_index.LLM | 142 |
| abstract_inverted_index.Our | 85 |
| abstract_inverted_index.The | 20, 160 |
| abstract_inverted_index.all | 107, 111 |
| abstract_inverted_index.and | 41, 82, 182 |
| abstract_inverted_index.for | 75, 110, 168 |
| abstract_inverted_index.the | 90, 117, 126, 169 |
| abstract_inverted_index.aims | 5 |
| abstract_inverted_index.down | 145 |
| abstract_inverted_index.into | 94 |
| abstract_inverted_index.lies | 23 |
| abstract_inverted_index.mask | 167 |
| abstract_inverted_index.pose | 152 |
| abstract_inverted_index.rely | 47 |
| abstract_inverted_index.text | 27, 81 |
| abstract_inverted_index.that | 53, 105 |
| abstract_inverted_index.this | 60 |
| abstract_inverted_index.when | 33 |
| abstract_inverted_index.with | 28, 38, 55, 140 |
| abstract_inverted_index.Large | 71 |
| abstract_inverted_index.Next, | 98 |
| abstract_inverted_index.Video | 1 |
| abstract_inverted_index.based | 15 |
| abstract_inverted_index.final | 161 |
| abstract_inverted_index.first | 135 |
| abstract_inverted_index.major | 178 |
| abstract_inverted_index.often | 46 |
| abstract_inverted_index.query | 93 |
| abstract_inverted_index.stage | 154 |
| abstract_inverted_index.three | 177 |
| abstract_inverted_index.video | 14, 83 |
| abstract_inverted_index.(RVOS) | 4 |
| abstract_inverted_index.MeViS. | 183 |
| abstract_inverted_index.Models | 73 |
| abstract_inverted_index.Object | 2 |
| abstract_inverted_index.across | 80 |
| abstract_inverted_index.begins | 87 |
| abstract_inverted_index.fusion | 52 |
| abstract_inverted_index.guided | 115 |
| abstract_inverted_index.module | 104, 124 |
| abstract_inverted_index.motion | 40, 138 |
| abstract_inverted_index.narrow | 144 |
| abstract_inverted_index.novel, | 66 |
| abstract_inverted_index.object | 9 |
| abstract_inverted_index.output | 162 |
| abstract_inverted_index.paper, | 61 |
| abstract_inverted_index.parsed | 118 |
| abstract_inverted_index.poses. | 42 |
| abstract_inverted_index.select | 125 |
| abstract_inverted_index.static | 26 |
| abstract_inverted_index.target | 113, 128, 170 |
| abstract_inverted_index.visual | 30 |
| abstract_inverted_index.(LLMs), | 74 |
| abstract_inverted_index.correct | 127 |
| abstract_inverted_index.current | 44 |
| abstract_inverted_index.dynamic | 29 |
| abstract_inverted_index.methods | 45 |
| abstract_inverted_index.natural | 91 |
| abstract_inverted_index.object. | 171 |
| abstract_inverted_index.objects | 34 |
| abstract_inverted_index.parsing | 89 |
| abstract_inverted_index.powered | 69 |
| abstract_inverted_index.propose | 63 |
| abstract_inverted_index.segment | 7 |
| abstract_inverted_index.similar | 36 |
| abstract_inverted_index.through | 129 |
| abstract_inverted_index.Finally, | 120 |
| abstract_inverted_index.However, | 43 |
| abstract_inverted_index.Language | 72 |
| abstract_inverted_index.accurate | 165 |
| abstract_inverted_index.achieved | 173 |
| abstract_inverted_index.aligning | 25 |
| abstract_inverted_index.approach | 86 |
| abstract_inverted_index.complex, | 56 |
| abstract_inverted_index.content, | 31 |
| abstract_inverted_index.domains. | 84 |
| abstract_inverted_index.holistic | 50 |
| abstract_inverted_index.interest | 11 |
| abstract_inverted_index.language | 18, 92 |
| abstract_inverted_index.objects, | 114 |
| abstract_inverted_index.performs | 136 |
| abstract_inverted_index.process: | 133 |
| abstract_inverted_index.remains, | 149 |
| abstract_inverted_index.semantic | 96 |
| abstract_inverted_index.Referring | 0 |
| abstract_inverted_index.ambiguity | 148 |
| abstract_inverted_index.candidate | 108 |
| abstract_inverted_index.challenge | 22 |
| abstract_inverted_index.commands. | 97 |
| abstract_inverted_index.framework | 68 |
| abstract_inverted_index.generates | 106 |
| abstract_inverted_index.grounding | 103 |
| abstract_inverted_index.introduce | 100 |
| abstract_inverted_index.potential | 112 |
| abstract_inverted_index.prominent | 21 |
| abstract_inverted_index.reasoning | 79, 132, 139 |
| abstract_inverted_index.struggles | 54 |
| abstract_inverted_index.triggered | 157 |
| abstract_inverted_index.two-stage | 131 |
| abstract_inverted_index.exhibiting | 35 |
| abstract_inverted_index.semantics. | 119 |
| abstract_inverted_index.structured | 95 |
| abstract_inverted_index.throughout | 12 |
| abstract_inverted_index.appearances | 37 |
| abstract_inverted_index.benchmarks: | 179 |
| abstract_inverted_index.candidates; | 146 |
| abstract_inverted_index.performance | 175 |
| abstract_inverted_index.Ref-DAVIS17, | 181 |
| abstract_inverted_index.Segmentation | 3 |
| abstract_inverted_index.description. | 19 |
| abstract_inverted_index.fine-grained | 151 |
| abstract_inverted_index.hierarchical | 122 |
| abstract_inverted_index.inconsistent | 39 |
| abstract_inverted_index.particularly | 32 |
| abstract_inverted_index.segmentation | 166 |
| abstract_inverted_index.trajectories | 109 |
| abstract_inverted_index.verification | 153 |
| abstract_inverted_index.compositional | 57 |
| abstract_inverted_index.conditionally | 156 |
| abstract_inverted_index.descriptions. | 58 |
| abstract_inverted_index.disambiguate. | 159 |
| abstract_inverted_index.hierarchical, | 77 |
| abstract_inverted_index.training-free | 67 |
| abstract_inverted_index.coarse-grained | 137 |
| abstract_inverted_index.coarse-to-fine | 78 |
| abstract_inverted_index.identification | 123 |
| abstract_inverted_index.spatio-temporal | 102 |
| abstract_inverted_index.visual-language | 51 |
| abstract_inverted_index.Ref-YouTube-VOS, | 180 |
| abstract_inverted_index.state-of-the-art | 174 |
| abstract_inverted_index.\textbf{PARSE-VOS} | 172 |
| abstract_inverted_index.\textbf{PARSE-VOS}, | 64 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |