Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Exploring foci of: arXiv (Cornell University) Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? November 2023 • Yujie Lu, Xiujun Li, William Yang Wang, Yejin Choi Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM)… Open Article Page

Computer Science Human–Computer Interaction Artificial Intelligence Programming Language Paleontology Biology Open Article