arXiv (Cornell University)
FastVLM: Efficient Vision Encoding for Vision Language Models
December 2024 • Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chunliang Li, Cem Koc, Nate True, Albert Antony, G. Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, …
Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens …