Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Exploring foci of: arXiv (Cornell University) Enhancing Large Vision Language Models with Self-Training on Image Comprehension May 2024 • Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai‐Wei Chang, Wei Wang Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. How… Open Article Page

Computer Science Artificial Intelligence Computer Vision Geography Programming Language Meteorology Open Article