FastVLM: Efficient Vision Encoding for Vision Language Models Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2412.13303
Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2$\times$ improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152$\times$1152), FastVLM achieves better performance on key benchmarks like SeedBench, MMMU and DocVQA, using the same 0.5B LLM, but with 85$\times$ faster TTFT and a vision encoder that is 3.4$\times$ smaller. Code and models are available at https://github.com/apple/ml-fastvlm.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2412.13303
- https://arxiv.org/pdf/2412.13303
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4405875854
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4405875854Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2412.13303Digital Object Identifier
- Title
-
FastVLM: Efficient Vision Encoding for Vision Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-12-17Full publication date if available
- Authors
-
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chunliang Li, Cem Koc, Nate True, Albert Antony, G. Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi PouransariList of authors in order
- Landing page
-
https://arxiv.org/abs/2412.13303Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2412.13303Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2412.13303Direct OA link when available
- Concepts
-
Encoding (memory), Computer science, Artificial intelligence, Computer visionTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4405875854 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2412.13303 |
| ids.doi | https://doi.org/10.48550/arxiv.2412.13303 |
| ids.openalex | https://openalex.org/W4405875854 |
| fwci | |
| type | preprint |
| title | FastVLM: Efficient Vision Encoding for Vision Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9930999875068665 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T10627 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9785000085830688 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Advanced Image and Video Retrieval Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C125411270 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7158800363540649 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q18653 |
| concepts[0].display_name | Encoding (memory) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.6017187833786011 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C154945302 |
| concepts[2].level | 1 |
| concepts[2].score | 0.4692608714103699 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[2].display_name | Artificial intelligence |
| concepts[3].id | https://openalex.org/C31972630 |
| concepts[3].level | 1 |
| concepts[3].score | 0.4582805037498474 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q844240 |
| concepts[3].display_name | Computer vision |
| keywords[0].id | https://openalex.org/keywords/encoding |
| keywords[0].score | 0.7158800363540649 |
| keywords[0].display_name | Encoding (memory) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.6017187833786011 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[2].score | 0.4692608714103699 |
| keywords[2].display_name | Artificial intelligence |
| keywords[3].id | https://openalex.org/keywords/computer-vision |
| keywords[3].score | 0.4582805037498474 |
| keywords[3].display_name | Computer vision |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2412.13303 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2412.13303 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2412.13303 |
| locations[1].id | doi:10.48550/arxiv.2412.13303 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2412.13303 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5013120724 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Pavan Kumar Anasosalu Vasu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Vasu, Pavan Kumar Anasosalu |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5036601505 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-5975-5158 |
| authorships[1].author.display_name | Fartash Faghri |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Faghri, Fartash |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5102972645 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-5938-5510 |
| authorships[2].author.display_name | Chunliang Li |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Li, Chun-Liang |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5104313177 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Cem Koc |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Koc, Cem |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5115691056 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Nate True |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | True, Nate |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5026835553 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Albert Antony |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Antony, Albert |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5083007242 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | G. Santhanam |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Santhanam, Gokul |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5017283719 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | James Gabriel |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Gabriel, James |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5094170434 |
| authorships[8].author.orcid | |
| authorships[8].author.display_name | Peter Grasch |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Grasch, Peter |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5028613002 |
| authorships[9].author.orcid | |
| authorships[9].author.display_name | Oncel Tuzel |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Tuzel, Oncel |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5059295598 |
| authorships[10].author.orcid | |
| authorships[10].author.display_name | Hadi Pouransari |
| authorships[10].author_position | last |
| authorships[10].raw_author_name | Pouransari, Hadi |
| authorships[10].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2412.13303 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | FastVLM: Efficient Vision Encoding for Vision Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9930999875068665 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W2772917594, https://openalex.org/W2036807459, https://openalex.org/W2058170566, https://openalex.org/W2755342338, https://openalex.org/W2166024367, https://openalex.org/W3116076068, https://openalex.org/W2229312674, https://openalex.org/W2951359407, https://openalex.org/W2079911747, https://openalex.org/W1969923398 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2412.13303 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2412.13303 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2412.13303 |
| primary_location.id | pmh:oai:arXiv.org:2412.13303 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2412.13303 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2412.13303 |
| publication_date | 2024-12-17 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 58, 86, 106, 122, 226 |
| abstract_inverted_index.At | 50 |
| abstract_inverted_index.In | 173 |
| abstract_inverted_index.an | 110 |
| abstract_inverted_index.as | 27 |
| abstract_inverted_index.at | 31, 198, 238 |
| abstract_inverted_index.be | 61 |
| abstract_inverted_index.by | 46, 156 |
| abstract_inverted_index.in | 17, 181 |
| abstract_inverted_index.is | 5, 230 |
| abstract_inverted_index.of | 11, 39, 57, 73, 90 |
| abstract_inverted_index.on | 85, 188, 207 |
| abstract_inverted_index.to | 35, 77, 128, 192, 196 |
| abstract_inverted_index.we | 103 |
| abstract_inverted_index.LLM | 101 |
| abstract_inverted_index.VLM | 59, 189 |
| abstract_inverted_index.and | 41, 69, 100, 117, 132, 152, 168, 213, 225, 234 |
| abstract_inverted_index.are | 236 |
| abstract_inverted_index.but | 220 |
| abstract_inverted_index.can | 60 |
| abstract_inverted_index.due | 34 |
| abstract_inverted_index.for | 7, 137, 164 |
| abstract_inverted_index.key | 208 |
| abstract_inverted_index.the | 1, 9, 36, 54, 71, 78, 91, 145, 158, 162, 170, 174, 199, 216 |
| abstract_inverted_index.two | 64 |
| abstract_inverted_index.0.5B | 218 |
| abstract_inverted_index.Code | 233 |
| abstract_inverted_index.LLM, | 79, 219 |
| abstract_inverted_index.MMMU | 212 |
| abstract_inverted_index.TTFT | 224 |
| abstract_inverted_index.ViTs | 28 |
| abstract_inverted_index.high | 32, 42 |
| abstract_inverted_index.like | 210 |
| abstract_inverted_index.need | 163 |
| abstract_inverted_index.same | 217 |
| abstract_inverted_index.size | 116 |
| abstract_inverted_index.such | 26 |
| abstract_inverted_index.that | 108, 229 |
| abstract_inverted_index.time | 136 |
| abstract_inverted_index.with | 221 |
| abstract_inverted_index.Based | 84 |
| abstract_inverted_index.along | 63 |
| abstract_inverted_index.axes: | 65 |
| abstract_inverted_index.count | 151 |
| abstract_inverted_index.fewer | 130 |
| abstract_inverted_index.image | 3, 19, 94, 153 |
| abstract_inverted_index.input | 2, 159 |
| abstract_inverted_index.large | 37 |
| abstract_inverted_index.model | 107, 115, 171 |
| abstract_inverted_index.novel | 123 |
| abstract_inverted_index.prior | 193 |
| abstract_inverted_index.size, | 102 |
| abstract_inverted_index.token | 98, 150, 166 |
| abstract_inverted_index.using | 215 |
| abstract_inverted_index.while | 184 |
| abstract_inverted_index.(TTFT) | 183 |
| abstract_inverted_index.Models | 14 |
| abstract_inverted_index.Unlike | 140 |
| abstract_inverted_index.Vision | 12 |
| abstract_inverted_index.become | 29 |
| abstract_inverted_index.better | 205 |
| abstract_inverted_index.caused | 45 |
| abstract_inverted_index.count, | 99 |
| abstract_inverted_index.faster | 223 |
| abstract_inverted_index.hybrid | 124 |
| abstract_inverted_index.image, | 160 |
| abstract_inverted_index.models | 235 |
| abstract_inverted_index.number | 38, 72 |
| abstract_inverted_index.output | 129 |
| abstract_inverted_index.passed | 76 |
| abstract_inverted_index.reduce | 134 |
| abstract_inverted_index.setup, | 176 |
| abstract_inverted_index.solely | 155 |
| abstract_inverted_index.tasks. | 21 |
| abstract_inverted_index.tokens | 40, 75, 131 |
| abstract_inverted_index.vision | 55, 96, 125, 227 |
| abstract_inverted_index.visual | 24, 74, 149 |
| abstract_inverted_index.works. | 194 |
| abstract_inverted_index.(VLMs), | 15 |
| abstract_inverted_index.DocVQA, | 214 |
| abstract_inverted_index.FastVLM | 119, 143, 177, 203 |
| abstract_inverted_index.Scaling | 0 |
| abstract_inverted_index.balance | 147 |
| abstract_inverted_index.between | 93, 113, 148 |
| abstract_inverted_index.design. | 172 |
| abstract_inverted_index.encoder | 56, 126, 228 |
| abstract_inverted_index.highest | 200 |
| abstract_inverted_index.images. | 139 |
| abstract_inverted_index.latency | 44, 68 |
| abstract_inverted_index.layers. | 49 |
| abstract_inverted_index.optimal | 146 |
| abstract_inverted_index.overall | 82 |
| abstract_inverted_index.popular | 23 |
| abstract_inverted_index.pruning | 167 |
| abstract_inverted_index.scaling | 157 |
| abstract_inverted_index.similar | 186 |
| abstract_inverted_index.stacked | 47 |
| abstract_inverted_index.thereby | 80 |
| abstract_inverted_index.Compared | 195 |
| abstract_inverted_index.FastVLM, | 105 |
| abstract_inverted_index.However, | 22 |
| abstract_inverted_index.Language | 13 |
| abstract_inverted_index.achieves | 109, 144, 178, 204 |
| abstract_inverted_index.analysis | 89 |
| abstract_inverted_index.compared | 191 |
| abstract_inverted_index.designed | 127 |
| abstract_inverted_index.encoders | 25 |
| abstract_inverted_index.encoding | 43, 67, 135 |
| abstract_inverted_index.latency, | 97, 114 |
| abstract_inverted_index.latency. | 83 |
| abstract_inverted_index.lowering | 81 |
| abstract_inverted_index.methods, | 142 |
| abstract_inverted_index.previous | 141 |
| abstract_inverted_index.reducing | 66 |
| abstract_inverted_index.smaller. | 232 |
| abstract_inverted_index.LLaVA-1.5 | 175 |
| abstract_inverted_index.accuracy. | 118 |
| abstract_inverted_index.available | 237 |
| abstract_inverted_index.different | 51 |
| abstract_inverted_index.enhancing | 8 |
| abstract_inverted_index.essential | 6 |
| abstract_inverted_index.interplay | 92 |
| abstract_inverted_index.introduce | 104 |
| abstract_inverted_index.optimized | 62, 111 |
| abstract_inverted_index.text-rich | 18 |
| abstract_inverted_index.trade-off | 112 |
| abstract_inverted_index.85$\times$ | 222 |
| abstract_inverted_index.FastViTHD, | 121 |
| abstract_inverted_index.SeedBench, | 211 |
| abstract_inverted_index.additional | 165 |
| abstract_inverted_index.benchmarks | 190, 209 |
| abstract_inverted_index.efficiency | 88 |
| abstract_inverted_index.minimizing | 70 |
| abstract_inverted_index.resolution | 4, 154, 201 |
| abstract_inverted_index.3.2$\times$ | 179 |
| abstract_inverted_index.3.4$\times$ | 231 |
| abstract_inverted_index.eliminating | 161 |
| abstract_inverted_index.improvement | 180 |
| abstract_inverted_index.inefficient | 30 |
| abstract_inverted_index.maintaining | 185 |
| abstract_inverted_index.operational | 52 |
| abstract_inverted_index.performance | 10, 187, 206 |
| abstract_inverted_index.resolution, | 95 |
| abstract_inverted_index.resolutions | 33 |
| abstract_inverted_index.simplifying | 169 |
| abstract_inverted_index.incorporates | 120 |
| abstract_inverted_index.particularly | 16 |
| abstract_inverted_index.resolutions, | 53 |
| abstract_inverted_index.comprehensive | 87 |
| abstract_inverted_index.significantly | 133 |
| abstract_inverted_index.understanding | 20 |
| abstract_inverted_index.self-attention | 48 |
| abstract_inverted_index.LLaVa-OneVision | 197 |
| abstract_inverted_index.high-resolution | 138 |
| abstract_inverted_index.(1152$\times$1152), | 202 |
| abstract_inverted_index.time-to-first-token | 182 |
| abstract_inverted_index.https://github.com/apple/ml-fastvlm. | 239 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 11 |
| citation_normalized_percentile |