Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2504.10465
Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2504.10465
- https://arxiv.org/pdf/2504.10465
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415161148
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415161148Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2504.10465Digital Object Identifier
- Title
-
Pixel-SAIL: Single Transformer For Pixel-Grounded UnderstandingWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-14Full publication date if available
- Authors
-
Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi FengList of authors in order
- Landing page
-
https://arxiv.org/abs/2504.10465Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2504.10465Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2504.10465Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415161148 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2504.10465 |
| ids.doi | https://doi.org/10.48550/arxiv.2504.10465 |
| ids.openalex | https://openalex.org/W4415161148 |
| fwci | |
| type | preprint |
| title | Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T13114 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.8991000056266785 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2214 |
| topics[0].subfield.display_name | Media Technology |
| topics[0].display_name | Image Processing Techniques and Applications |
| topics[1].id | https://openalex.org/T11775 |
| topics[1].field.id | https://openalex.org/fields/27 |
| topics[1].field.display_name | Medicine |
| topics[1].score | 0.8360000252723694 |
| topics[1].domain.id | https://openalex.org/domains/4 |
| topics[1].domain.display_name | Health Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/2741 |
| topics[1].subfield.display_name | Radiology, Nuclear Medicine and Imaging |
| topics[1].display_name | COVID-19 diagnosis using AI |
| topics[2].id | https://openalex.org/T12702 |
| topics[2].field.id | https://openalex.org/fields/28 |
| topics[2].field.display_name | Neuroscience |
| topics[2].score | 0.8141999840736389 |
| topics[2].domain.id | https://openalex.org/domains/1 |
| topics[2].domain.display_name | Life Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/2808 |
| topics[2].subfield.display_name | Neurology |
| topics[2].display_name | Brain Tumor Detection and Classification |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2504.10465 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2504.10465 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2504.10465 |
| locations[1].id | doi:10.48550/arxiv.2504.10465 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2504.10465 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100375710 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-6643-4686 |
| authorships[0].author.display_name | Tao Zhang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhang, Tao |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5049449805 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Xiangtai Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Xiangtai |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5101358906 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Zilong Huang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Huang, Zilong |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100727821 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-5727-4772 |
| authorships[3].author.display_name | Yanwei Li |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Li, Yanwei |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5039070189 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Weixian Lei |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Lei, Weixian |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5102016553 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-0991-6966 |
| authorships[5].author.display_name | Xueqing Deng |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Deng, Xueqing |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5055000437 |
| authorships[6].author.orcid | https://orcid.org/0000-0001-7646-8003 |
| authorships[6].author.display_name | Shihao Chen |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Chen, Shihao |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5031588692 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-3088-1481 |
| authorships[7].author.display_name | Shunping Ji |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Ji, Shunping |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5100668696 |
| authorships[8].author.orcid | https://orcid.org/0000-0001-6843-0064 |
| authorships[8].author.display_name | Jiashi Feng |
| authorships[8].author_position | last |
| authorships[8].raw_author_name | Feng, Jiashi |
| authorships[8].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2504.10465 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-14T00:00:00 |
| display_name | Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T13114 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.8991000056266785 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2214 |
| primary_topic.subfield.display_name | Media Technology |
| primary_topic.display_name | Image Processing Techniques and Applications |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2504.10465 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2504.10465 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2504.10465 |
| primary_location.id | pmh:oai:arXiv.org:2504.10465 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2504.10465 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2504.10465 |
| publication_date | 2025-04-14 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 46, 66, 87, 108, 120, 152, 172, 179, 222 |
| abstract_inverted_index.In | 38, 94, 167 |
| abstract_inverted_index.It | 182 |
| abstract_inverted_index.We | 84 |
| abstract_inverted_index.as | 23, 65 |
| abstract_inverted_index.at | 232 |
| abstract_inverted_index.be | 230 |
| abstract_inverted_index.by | 58 |
| abstract_inverted_index.in | 82 |
| abstract_inverted_index.is | 43, 56 |
| abstract_inverted_index.of | 142 |
| abstract_inverted_index.on | 19, 62, 101, 199 |
| abstract_inverted_index.or | 217 |
| abstract_inverted_index.to | 30, 44, 112, 126, 131, 157 |
| abstract_inverted_index.we | 96, 106, 118, 150, 169 |
| abstract_inverted_index.Our | 54 |
| abstract_inverted_index.all | 14 |
| abstract_inverted_index.and | 34, 79, 136, 146, 193, 208, 227 |
| abstract_inverted_index.for | 8, 90 |
| abstract_inverted_index.one | 204 |
| abstract_inverted_index.our | 41, 209, 213 |
| abstract_inverted_index.the | 15, 59, 102, 128, 139, 160 |
| abstract_inverted_index.Code | 226 |
| abstract_inverted_index.MLLM | 49, 92 |
| abstract_inverted_index.even | 218 |
| abstract_inverted_index.four | 200 |
| abstract_inverted_index.from | 138 |
| abstract_inverted_index.goal | 42 |
| abstract_inverted_index.have | 170 |
| abstract_inverted_index.high | 31 |
| abstract_inverted_index.much | 223 |
| abstract_inverted_index.rely | 17 |
| abstract_inverted_index.show | 211 |
| abstract_inverted_index.such | 22 |
| abstract_inverted_index.text | 80 |
| abstract_inverted_index.that | 212 |
| abstract_inverted_index.this | 39 |
| abstract_inverted_index.will | 229 |
| abstract_inverted_index.with | 221 |
| abstract_inverted_index.work | 55 |
| abstract_inverted_index.Large | 1 |
| abstract_inverted_index.Model | 69 |
| abstract_inverted_index.early | 140 |
| abstract_inverted_index.extra | 20, 52 |
| abstract_inverted_index.learn | 76 |
| abstract_inverted_index.model | 36, 228 |
| abstract_inverted_index.novel | 121 |
| abstract_inverted_index.pixel | 174 |
| abstract_inverted_index.plain | 103 |
| abstract_inverted_index.these | 73 |
| abstract_inverted_index.three | 98, 184 |
| abstract_inverted_index.token | 115 |
| abstract_inverted_index.using | 178 |
| abstract_inverted_index.where | 72 |
| abstract_inverted_index.work, | 40 |
| abstract_inverted_index.works | 16, 61, 74 |
| abstract_inverted_index.(SAIL) | 70 |
| abstract_inverted_index.First, | 105 |
| abstract_inverted_index.Models | 3 |
| abstract_inverted_index.Single | 63 |
| abstract_inverted_index.better | 219 |
| abstract_inverted_index.check. | 181 |
| abstract_inverted_index.design | 107 |
| abstract_inverted_index.enable | 127 |
| abstract_inverted_index.expert | 154 |
| abstract_inverted_index.fusion | 141 |
| abstract_inverted_index.highly | 47 |
| abstract_inverted_index.inputs | 135 |
| abstract_inverted_index.manual | 180 |
| abstract_inverted_index.module | 111 |
| abstract_inverted_index.object | 187 |
| abstract_inverted_index.prompt | 123, 134, 144, 206 |
| abstract_inverted_index.recent | 60 |
| abstract_inverted_index.refine | 113 |
| abstract_inverted_index.single | 88, 129, 161 |
| abstract_inverted_index.system | 32 |
| abstract_inverted_index.tasks. | 12, 93 |
| abstract_inverted_index.tasks: | 185 |
| abstract_inverted_index.tokens | 78, 81 |
| abstract_inverted_index.vision | 24, 77, 147, 153 |
| abstract_inverted_index.visual | 114, 122, 133, 143, 189, 205 |
| abstract_inverted_index.(CLIP), | 26 |
| abstract_inverted_index.(MLLMs) | 4 |
| abstract_inverted_index.achieve | 5 |
| abstract_inverted_index.benefit | 137 |
| abstract_inverted_index.design, | 71 |
| abstract_inverted_index.encoder | 25 |
| abstract_inverted_index.enhance | 159 |
| abstract_inverted_index.explore | 45 |
| abstract_inverted_index.feature | 164 |
| abstract_inverted_index.heavily | 18 |
| abstract_inverted_index.jointly | 75 |
| abstract_inverted_index.leading | 29 |
| abstract_inverted_index.present | 85, 97 |
| abstract_inverted_index.propose | 119 |
| abstract_inverted_index.results | 220 |
| abstract_inverted_index.simpler | 224 |
| abstract_inverted_index.tokens. | 148 |
| abstract_inverted_index.unified | 67 |
| abstract_inverted_index.without | 50 |
| abstract_inverted_index.However, | 13 |
| abstract_inverted_index.Language | 2 |
| abstract_inverted_index.PerBench | 210 |
| abstract_inverted_index.Thirdly, | 149 |
| abstract_inverted_index.achieves | 215 |
| abstract_inverted_index.detailed | 186 |
| abstract_inverted_index.experts, | 28 |
| abstract_inverted_index.includes | 183 |
| abstract_inverted_index.limiting | 35 |
| abstract_inverted_index.question | 191 |
| abstract_inverted_index.released | 231 |
| abstract_inverted_index.scaling. | 37 |
| abstract_inverted_index.strategy | 125, 156 |
| abstract_inverted_index.Extensive | 197 |
| abstract_inverted_index.Secondly, | 117 |
| abstract_inverted_index.addition, | 168 |
| abstract_inverted_index.baseline. | 104 |
| abstract_inverted_index.benchmark | 176 |
| abstract_inverted_index.collected | 171 |
| abstract_inverted_index.features. | 116 |
| abstract_inverted_index.injection | 124 |
| abstract_inverted_index.introduce | 151 |
| abstract_inverted_index.learnable | 109 |
| abstract_inverted_index.motivated | 57 |
| abstract_inverted_index.pipeline. | 225 |
| abstract_inverted_index.referring | 195, 201 |
| abstract_inverted_index.technical | 99 |
| abstract_inverted_index.Multimodal | 0 |
| abstract_inverted_index.Pixel-SAIL | 214 |
| abstract_inverted_index.answering, | 192 |
| abstract_inverted_index.benchmark, | 207 |
| abstract_inverted_index.comparable | 216 |
| abstract_inverted_index.complexity | 33 |
| abstract_inverted_index.embeddings | 145 |
| abstract_inverted_index.extraction | 165 |
| abstract_inverted_index.pixel-wise | 91 |
| abstract_inverted_index.remarkable | 6 |
| abstract_inverted_index.simplified | 48 |
| abstract_inverted_index.understand | 132 |
| abstract_inverted_index.upsampling | 110 |
| abstract_inverted_index.(PerBench), | 177 |
| abstract_inverted_index.Pixel-SAIL, | 86 |
| abstract_inverted_index.benchmarks, | 203 |
| abstract_inverted_index.capability. | 166 |
| abstract_inverted_index.components, | 21 |
| abstract_inverted_index.components. | 53 |
| abstract_inverted_index.efficiently | 158 |
| abstract_inverted_index.experiments | 198 |
| abstract_inverted_index.introducing | 51 |
| abstract_inverted_index.particular, | 95 |
| abstract_inverted_index.performance | 7 |
| abstract_inverted_index.pixel-level | 10 |
| abstract_inverted_index.trAnsformer | 64 |
| abstract_inverted_index.transformer | 89, 130 |
| abstract_inverted_index.visual-text | 194 |
| abstract_inverted_index.description, | 188 |
| abstract_inverted_index.distillation | 155 |
| abstract_inverted_index.fine-grained | 9, 163 |
| abstract_inverted_index.improvements | 100 |
| abstract_inverted_index.prompt-based | 190 |
| abstract_inverted_index.segmentation | 27, 202 |
| abstract_inverted_index.comprehensive | 173 |
| abstract_inverted_index.segmentation. | 196 |
| abstract_inverted_index.transformer's | 162 |
| abstract_inverted_index.transformers. | 83 |
| abstract_inverted_index.understanding | 11, 175 |
| abstract_inverted_index.vIsion-Language | 68 |
| abstract_inverted_index.https://github.com/magic-research/Sa2VA. | 233 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 9 |
| citation_normalized_percentile |