Exploring foci of:
arXiv (Cornell University)
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
November 2023 • Yujie Lu, Xiujun Li, William Yang Wang, Yejin Choi
Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM)…
Computer Science
Human–Computer Interaction
Artificial Intelligence
Programming Language
Paleontology
Biology