VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2507.13348
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2507.13348
- https://arxiv.org/pdf/2507.13348
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415310642
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415310642Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2507.13348Digital Object Identifier
- Title
-
VisionThink: Smart and Efficient Vision Language Model via Reinforcement LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-07-17Full publication date if available
- Authors
-
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya JiaList of authors in order
- Landing page
-
https://arxiv.org/abs/2507.13348Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2507.13348Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2507.13348Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415310642 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2507.13348 |
| ids.doi | https://doi.org/10.48550/arxiv.2507.13348 |
| ids.openalex | https://openalex.org/W4415310642 |
| fwci | |
| type | preprint |
| title | VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.972599983215332 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2507.13348 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2507.13348 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2507.13348 |
| locations[1].id | doi:10.48550/arxiv.2507.13348 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2507.13348 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5066489766 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Senqiao Yang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Yang, Senqiao |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100363212 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-8267-9939 |
| authorships[1].author.display_name | Junyi Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Junyi |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5010224302 |
| authorships[2].author.orcid | https://orcid.org/0000-0003-4913-5822 |
| authorships[2].author.display_name | Xin Lai |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Lai, Xin |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5051340429 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-6406-4810 |
| authorships[3].author.display_name | Bei Yu |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Yu, Bei |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5078109015 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-8277-2706 |
| authorships[4].author.display_name | Hengshuang Zhao |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhao, Hengshuang |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5052856441 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-1246-553X |
| authorships[5].author.display_name | Jiaya Jia |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Jia, Jiaya |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2507.13348 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-18T00:00:00 |
| display_name | VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.972599983215332 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2507.13348 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2507.13348 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2507.13348 |
| primary_location.id | pmh:oai:arXiv.org:2507.13348 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2507.13348 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2507.13348 |
| publication_date | 2025-07-17 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 47, 80, 92, 110, 144, 186, 194 |
| abstract_inverted_index.As | 143 |
| abstract_inverted_index.It | 89 |
| abstract_inverted_index.RL | 177 |
| abstract_inverted_index.We | 165 |
| abstract_inverted_index.an | 35 |
| abstract_inverted_index.at | 217 |
| abstract_inverted_index.by | 9, 141 |
| abstract_inverted_index.do | 31 |
| abstract_inverted_index.in | 2, 46, 57 |
| abstract_inverted_index.is | 100, 215 |
| abstract_inverted_index.it | 99, 146 |
| abstract_inverted_index.of | 13, 38, 50, 210 |
| abstract_inverted_index.on | 153, 162 |
| abstract_inverted_index.or | 131 |
| abstract_inverted_index.to | 70, 113, 119, 137, 174, 178, 192 |
| abstract_inverted_index.we | 25, 68, 183 |
| abstract_inverted_index.1/4 | 65 |
| abstract_inverted_index.Our | 213 |
| abstract_inverted_index.VLM | 122 |
| abstract_inverted_index.VQA | 61, 180 |
| abstract_inverted_index.and | 78, 95, 156, 169, 189, 196, 208 |
| abstract_inverted_index.are | 17 |
| abstract_inverted_index.for | 83, 102 |
| abstract_inverted_index.new | 81 |
| abstract_inverted_index.not | 32 |
| abstract_inverted_index.our | 211 |
| abstract_inverted_index.the | 11, 42, 106, 115, 171, 205 |
| abstract_inverted_index.call | 200 |
| abstract_inverted_index.case | 140 |
| abstract_inverted_index.code | 214 |
| abstract_inverted_index.have | 6 |
| abstract_inverted_index.most | 28, 58 |
| abstract_inverted_index.only | 64 |
| abstract_inverted_index.such | 34 |
| abstract_inverted_index.text | 22 |
| abstract_inverted_index.than | 21 |
| abstract_inverted_index.that | 27, 124 |
| abstract_inverted_index.with | 63, 75, 91 |
| abstract_inverted_index.While | 41 |
| abstract_inverted_index.adopt | 166 |
| abstract_inverted_index.apply | 176 |
| abstract_inverted_index.case. | 142 |
| abstract_inverted_index.could | 108 |
| abstract_inverted_index.drops | 44 |
| abstract_inverted_index.fixed | 128 |
| abstract_inverted_index.image | 94, 198 |
| abstract_inverted_index.model | 107 |
| abstract_inverted_index.often | 18 |
| abstract_inverted_index.other | 59 |
| abstract_inverted_index.saves | 158 |
| abstract_inverted_index.small | 48 |
| abstract_inverted_index.still | 54 |
| abstract_inverted_index.tasks | 62 |
| abstract_inverted_index.token | 85, 112 |
| abstract_inverted_index.using | 127 |
| abstract_inverted_index.which | 16 |
| abstract_inverted_index.(VLMs) | 5 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.design | 185 |
| abstract_inverted_index.image. | 117 |
| abstract_inverted_index.longer | 20 |
| abstract_inverted_index.models | 4, 53 |
| abstract_inverted_index.number | 12, 37 |
| abstract_inverted_index.output | 109 |
| abstract_inverted_index.ratio. | 201 |
| abstract_inverted_index.ratios | 130 |
| abstract_inverted_index.resize | 199 |
| abstract_inverted_index.reward | 187 |
| abstract_inverted_index.stable | 195 |
| abstract_inverted_index.starts | 90 |
| abstract_inverted_index.strong | 148 |
| abstract_inverted_index.subset | 49 |
| abstract_inverted_index.tasks, | 52, 155 |
| abstract_inverted_index.tasks. | 164, 181 |
| abstract_inverted_index.tokens | 126, 139, 161 |
| abstract_inverted_index.visual | 14, 39, 84, 150, 160 |
| abstract_inverted_index.achieve | 193 |
| abstract_inverted_index.decides | 97, 135 |
| abstract_inverted_index.general | 60, 179 |
| abstract_inverted_index.method. | 212 |
| abstract_inverted_index.methods | 123 |
| abstract_inverted_index.namely, | 87 |
| abstract_inverted_index.observe | 26 |
| abstract_inverted_index.penalty | 190 |
| abstract_inverted_index.perform | 55 |
| abstract_inverted_index.present | 79 |
| abstract_inverted_index.problem | 103 |
| abstract_inverted_index.process | 72 |
| abstract_inverted_index.propose | 69, 170 |
| abstract_inverted_index.pruning | 129 |
| abstract_inverted_index.request | 114 |
| abstract_inverted_index.require | 33 |
| abstract_inverted_index.result, | 145 |
| abstract_inverted_index.samples | 74 |
| abstract_inverted_index.simpler | 163 |
| abstract_inverted_index.smartly | 96 |
| abstract_inverted_index.special | 111 |
| abstract_inverted_index.tokens, | 15 |
| abstract_inverted_index.tokens. | 23, 40 |
| abstract_inverted_index.whether | 98, 136 |
| abstract_inverted_index.Compared | 118 |
| abstract_inverted_index.However, | 24 |
| abstract_inverted_index.compress | 125, 138 |
| abstract_inverted_index.distinct | 73 |
| abstract_inverted_index.existing | 120 |
| abstract_inverted_index.function | 188 |
| abstract_inverted_index.improved | 7 |
| abstract_inverted_index.learning | 168 |
| abstract_inverted_index.paradigm | 82 |
| abstract_inverted_index.solving. | 104 |
| abstract_inverted_index.strategy | 173 |
| abstract_inverted_index.Efficient | 121 |
| abstract_inverted_index.Extensive | 202 |
| abstract_inverted_index.Moreover, | 182 |
| abstract_inverted_index.available | 216 |
| abstract_inverted_index.carefully | 184 |
| abstract_inverted_index.different | 76 |
| abstract_inverted_index.extensive | 36 |
| abstract_inverted_index.meanwhile | 157 |
| abstract_inverted_index.mechanism | 191 |
| abstract_inverted_index.scenarios | 30 |
| abstract_inverted_index.Otherwise, | 105 |
| abstract_inverted_index.Therefore, | 67 |
| abstract_inverted_index.accurately | 56 |
| abstract_inverted_index.capability | 152 |
| abstract_inverted_index.increasing | 10 |
| abstract_inverted_index.real-world | 29 |
| abstract_inverted_index.reasonable | 197 |
| abstract_inverted_index.sufficient | 101 |
| abstract_inverted_index.OCR-related | 51, 154 |
| abstract_inverted_index.VisionThink | 133 |
| abstract_inverted_index.demonstrate | 204 |
| abstract_inverted_index.downsampled | 93 |
| abstract_inverted_index.dynamically | 71 |
| abstract_inverted_index.efficiency, | 207 |
| abstract_inverted_index.experiments | 203 |
| abstract_inverted_index.performance | 8, 43 |
| abstract_inverted_index.resolution. | 66 |
| abstract_inverted_index.substantial | 159 |
| abstract_inverted_index.thresholds, | 132 |
| abstract_inverted_index.LLM-as-Judge | 172 |
| abstract_inverted_index.VisionThink. | 88 |
| abstract_inverted_index.advancements | 1 |
| abstract_inverted_index.autonomously | 134 |
| abstract_inverted_index.compression, | 86 |
| abstract_inverted_index.demonstrates | 147 |
| abstract_inverted_index.fine-grained | 149 |
| abstract_inverted_index.resolutions, | 77 |
| abstract_inverted_index.successfully | 175 |
| abstract_inverted_index.superiority, | 206 |
| abstract_inverted_index.effectiveness | 209 |
| abstract_inverted_index.reinforcement | 167 |
| abstract_inverted_index.significantly | 19, 45 |
| abstract_inverted_index.understanding | 151 |
| abstract_inverted_index.vision-language | 3 |
| abstract_inverted_index.higher-resolution | 116 |
| abstract_inverted_index.https://github.com/dvlab-research/VisionThink. | 218 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |