Enhancing Large Vision Language Models with Self-Training on Image Comprehension Article Swipe

PDF

Yihe Deng , Pan Lu , Fan Yin , Ziniu Hu , Sheng Shen , James Zou , Kai‐Wei Chang , Wei Wang ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2405.19716

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

Related Topics

Computer Science

Artificial Intelligence

Concepts

Comprehension Training (meteorology) Image (mathematics) Computer science Artificial intelligence Computer vision Natural language processing Psychology Cognitive psychology Geography Programming language Meteorology

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2405.19716
PDF: https://arxiv.org/pdf/2405.19716
OA Status: green
Cited By: 1
Related Works: 10
OpenAlex ID: https://openalex.org/W4399252119

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4399252119

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2405.19716

Digital Object Identifier
Title: Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-05-30

Full publication date if available
Authors: Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai‐Wei Chang, Wei Wang

List of authors in order
Landing page: https://arxiv.org/abs/2405.19716

Publisher landing page
PDF URL: https://arxiv.org/pdf/2405.19716

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2405.19716

Direct OA link when available
Concepts: Comprehension, Training (meteorology), Image (mathematics), Computer science, Artificial intelligence, Computer vision, Natural language processing, Psychology, Cognitive psychology, Geography, Programming language, Meteorology

Top concepts (fields/topics) attached by OpenAlex
Cited by: 1

Total citation count in OpenAlex
Citations by year (recent): 2025: 1

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4399252119
doi	https://doi.org/10.48550/arxiv.2405.19716
ids.doi	https://doi.org/10.48550/arxiv.2405.19716
ids.openalex	https://openalex.org/W4399252119
fwci
type	preprint
title	Enhancing Large Vision Language Models with Self-Training on Image Comprehension
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T11714
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9545000195503235
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1707
topics[0].subfield.display_name	Computer Vision and Pattern Recognition
topics[0].display_name	Multimodal Machine Learning Applications
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C511192102
concepts[0].level	2
concepts[0].score	0.6694580912590027
concepts[0].wikidata	https://www.wikidata.org/wiki/Q5156948
concepts[0].display_name	Comprehension
concepts[1].id	https://openalex.org/C2777211547
concepts[1].level	2
concepts[1].score	0.6065587401390076
concepts[1].wikidata	https://www.wikidata.org/wiki/Q17141490
concepts[1].display_name	Training (meteorology)
concepts[2].id	https://openalex.org/C115961682
concepts[2].level	2
concepts[2].score	0.5868142247200012
concepts[2].wikidata	https://www.wikidata.org/wiki/Q860623
concepts[2].display_name	Image (mathematics)
concepts[3].id	https://openalex.org/C41008148
concepts[3].level	0
concepts[3].score	0.5585402846336365
concepts[3].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[3].display_name	Computer science
concepts[4].id	https://openalex.org/C154945302
concepts[4].level	1
concepts[4].score	0.42888548970222473
concepts[4].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[4].display_name	Artificial intelligence
concepts[5].id	https://openalex.org/C31972630
concepts[5].level	1
concepts[5].score	0.3863084614276886
concepts[5].wikidata	https://www.wikidata.org/wiki/Q844240
concepts[5].display_name	Computer vision
concepts[6].id	https://openalex.org/C204321447
concepts[6].level	1
concepts[6].score	0.36513984203338623
concepts[6].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[6].display_name	Natural language processing
concepts[7].id	https://openalex.org/C15744967
concepts[7].level	0
concepts[7].score	0.35456883907318115
concepts[7].wikidata	https://www.wikidata.org/wiki/Q9418
concepts[7].display_name	Psychology
concepts[8].id	https://openalex.org/C180747234
concepts[8].level	1
concepts[8].score	0.34942173957824707
concepts[8].wikidata	https://www.wikidata.org/wiki/Q23373
concepts[8].display_name	Cognitive psychology
concepts[9].id	https://openalex.org/C205649164
concepts[9].level	0
concepts[9].score	0.09690457582473755
concepts[9].wikidata	https://www.wikidata.org/wiki/Q1071
concepts[9].display_name	Geography
concepts[10].id	https://openalex.org/C199360897
concepts[10].level	1
concepts[10].score	0.06923693418502808
concepts[10].wikidata	https://www.wikidata.org/wiki/Q9143
concepts[10].display_name	Programming language
concepts[11].id	https://openalex.org/C153294291
concepts[11].level	1
concepts[11].score	0.0
concepts[11].wikidata	https://www.wikidata.org/wiki/Q25261
concepts[11].display_name	Meteorology
keywords[0].id	https://openalex.org/keywords/comprehension
keywords[0].score	0.6694580912590027
keywords[0].display_name	Comprehension
keywords[1].id	https://openalex.org/keywords/training
keywords[1].score	0.6065587401390076
keywords[1].display_name	Training (meteorology)
keywords[2].id	https://openalex.org/keywords/image
keywords[2].score	0.5868142247200012
keywords[2].display_name	Image (mathematics)
keywords[3].id	https://openalex.org/keywords/computer-science
keywords[3].score	0.5585402846336365
keywords[3].display_name	Computer science
keywords[4].id	https://openalex.org/keywords/artificial-intelligence
keywords[4].score	0.42888548970222473
keywords[4].display_name	Artificial intelligence
keywords[5].id	https://openalex.org/keywords/computer-vision
keywords[5].score	0.3863084614276886
keywords[5].display_name	Computer vision
keywords[6].id	https://openalex.org/keywords/natural-language-processing
keywords[6].score	0.36513984203338623
keywords[6].display_name	Natural language processing
keywords[7].id	https://openalex.org/keywords/psychology
keywords[7].score	0.35456883907318115
keywords[7].display_name	Psychology
keywords[8].id	https://openalex.org/keywords/cognitive-psychology
keywords[8].score	0.34942173957824707
keywords[8].display_name	Cognitive psychology
keywords[9].id	https://openalex.org/keywords/geography
keywords[9].score	0.09690457582473755
keywords[9].display_name	Geography
keywords[10].id	https://openalex.org/keywords/programming-language
keywords[10].score	0.06923693418502808
keywords[10].display_name	Programming language
language	en
locations[0].id	pmh:oai:arXiv.org:2405.19716
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2405.19716
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2405.19716
locations[1].id	doi:10.48550/arxiv.2405.19716
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2405.19716
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5109683001
authorships[0].author.orcid
authorships[0].author.display_name	Yihe Deng
authorships[0].author_position	first
authorships[0].raw_author_name	Deng, Yihe
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5100567599
authorships[1].author.orcid	https://orcid.org/0000-0002-2193-8415
authorships[1].author.display_name	Pan Lu
authorships[1].author_position	middle
authorships[1].raw_author_name	Lu, Pan
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5053347296
authorships[2].author.orcid	https://orcid.org/0000-0002-8028-2217
authorships[2].author.display_name	Fan Yin
authorships[2].author_position	middle
authorships[2].raw_author_name	Yin, Fan
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5114973490
authorships[3].author.orcid	https://orcid.org/0000-0003-4663-0166
authorships[3].author.display_name	Ziniu Hu
authorships[3].author_position	middle
authorships[3].raw_author_name	Hu, Ziniu
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5100784815
authorships[4].author.orcid	https://orcid.org/0000-0002-8773-6365
authorships[4].author.display_name	Sheng Shen
authorships[4].author_position	middle
authorships[4].raw_author_name	Shen, Sheng
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5005779176
authorships[5].author.orcid	https://orcid.org/0000-0001-8880-4764
authorships[5].author.display_name	James Zou
authorships[5].author_position	middle
authorships[5].raw_author_name	Zou, James
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5073201681
authorships[6].author.orcid	https://orcid.org/0000-0002-4991-5274
authorships[6].author.display_name	Kai‐Wei Chang
authorships[6].author_position	middle
authorships[6].raw_author_name	Chang, Kai-Wei
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5100335759
authorships[7].author.orcid	https://orcid.org/0000-0002-3733-3939
authorships[7].author.display_name	Wei Wang
authorships[7].author_position	last
authorships[7].raw_author_name	Wang, Wei
authorships[7].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2405.19716
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2024-06-01T00:00:00
display_name	Enhancing Large Vision Language Models with Self-Training on Image Comprehension
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T11714
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9545000195503235
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1707
primary_topic.subfield.display_name	Computer Vision and Pattern Recognition
primary_topic.display_name	Multimodal Machine Learning Applications
related_works	https://openalex.org/W230091440, https://openalex.org/W2233261550, https://openalex.org/W2810751659, https://openalex.org/W258997015, https://openalex.org/W2997094352, https://openalex.org/W3216976533, https://openalex.org/W100620283, https://openalex.org/W2495260952, https://openalex.org/W4366179611, https://openalex.org/W2996078371
cited_by_count	1
counts_by_year[0].year	2025
counts_by_year[0].cited_by_count	1
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2405.19716
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2405.19716
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2405.19716
primary_location.id	pmh:oai:arXiv.org:2405.19716
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2405.19716
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2405.19716
publication_date	2024-05-30
publication_year	2024
referenced_works_count	0
abstract_inverted_index.a	71, 95, 106, 120, 149
abstract_inverted_index.To	83, 135
abstract_inverted_index.We	165
abstract_inverted_index.by	62
abstract_inverted_index.in	52
abstract_inverted_index.is	41
abstract_inverted_index.of	19, 81, 152, 169, 179, 199, 209
abstract_inverted_index.on	89, 139, 181
abstract_inverted_index.or	132
abstract_inverted_index.to	22, 45, 55, 162, 205
abstract_inverted_index.we	86, 144
abstract_inverted_index.70%	185
abstract_inverted_index.and	29, 43, 78, 156, 201, 215
abstract_inverted_index.are	117, 126, 217
abstract_inverted_index.for	26, 59, 99, 109, 212
abstract_inverted_index.its	158, 203
abstract_inverted_index.let	145
abstract_inverted_index.own	65
abstract_inverted_index.the	16, 20, 57, 74, 103, 140, 146, 163, 167, 191
abstract_inverted_index.4.0%	180
abstract_inverted_index.Code	214
abstract_inverted_index.STIC	170, 200
abstract_inverted_index.been	50
abstract_inverted_index.data	61, 155, 189, 216
abstract_inverted_index.from	128
abstract_inverted_index.have	49
abstract_inverted_index.less	186
abstract_inverted_index.made	218
abstract_inverted_index.need	58
abstract_inverted_index.than	190
abstract_inverted_index.this	34
abstract_inverted_index.vast	207
abstract_inverted_index.with	10
abstract_inverted_index.Image	90
abstract_inverted_index.Large	0
abstract_inverted_index.data,	39
abstract_inverted_index.gains	178
abstract_inverted_index.image	24, 100, 110, 160
abstract_inverted_index.large	6
abstract_inverted_index.model	21, 104, 147
abstract_inverted_index.reuse	148
abstract_inverted_index.seven	172
abstract_inverted_index.small	150
abstract_inverted_index.this,	85
abstract_inverted_index.using	112, 184
abstract_inverted_index.which	40, 93
abstract_inverted_index.while	123, 183
abstract_inverted_index.(LLMs)	9
abstract_inverted_index.First,	102
abstract_inverted_index.LVLMs.	82
abstract_inverted_index.across	171
abstract_inverted_index.append	157
abstract_inverted_index.costly	42
abstract_inverted_index.either	129
abstract_inverted_index.images	131, 211
abstract_inverted_index.inputs	25
abstract_inverted_index.models	3, 8
abstract_inverted_index.unique	75
abstract_inverted_index.vision	1, 12
abstract_inverted_index.visual	76, 142
abstract_inverted_index.(LVLMs)	4
abstract_inverted_index.(STIC),	92
abstract_inverted_index.Further	194
abstract_inverted_index.address	84
abstract_inverted_index.average	182
abstract_inverted_index.conduct	30
abstract_inverted_index.current	192
abstract_inverted_index.dataset	108
abstract_inverted_index.further	136
abstract_inverted_index.images.	114
abstract_inverted_index.labeled	60
abstract_inverted_index.method.	193
abstract_inverted_index.model's	64
abstract_inverted_index.portion	151
abstract_inverted_index.prompt,	122
abstract_inverted_index.queries	28
abstract_inverted_index.remains	70
abstract_inverted_index.studies	195
abstract_inverted_index.thereby	14
abstract_inverted_index.through	119
abstract_inverted_index.various	197
abstract_inverted_index.However,	67
abstract_inverted_index.acquire.	46
abstract_inverted_index.approach	97
abstract_inverted_index.existing	153
abstract_inverted_index.language	2, 7
abstract_inverted_index.leverage	206
abstract_inverted_index.prompts.	134, 164
abstract_inverted_index.publicly	219
abstract_inverted_index.requires	36
abstract_inverted_index.settings	54
abstract_inverted_index.validate	166
abstract_inverted_index.Improving	33
abstract_inverted_index.Preferred	115
abstract_inverted_index.alleviate	56
abstract_inverted_index.challenge	72
abstract_inverted_index.corrupted	130
abstract_inverted_index.different	27, 173
abstract_inverted_index.effective	51, 68
abstract_inverted_index.encoders,	13
abstract_inverted_index.extracted	141
abstract_inverted_index.generated	118, 127
abstract_inverted_index.highlight	202
abstract_inverted_index.integrate	5
abstract_inverted_index.introduce	87
abstract_inverted_index.potential	204
abstract_inverted_index.reasoning	79, 138
abstract_inverted_index.regarding	73
abstract_inverted_index.responses	116, 125
abstract_inverted_index.unlabeled	113, 210
abstract_inverted_index.activating	15
abstract_inverted_index.approaches	48
abstract_inverted_index.available.	220
abstract_inverted_index.capability	18, 35, 80
abstract_inverted_index.components	198
abstract_inverted_index.emphasizes	94
abstract_inverted_index.leveraging	63
abstract_inverted_index.misleading	133
abstract_inverted_index.perception	17, 77
abstract_inverted_index.preference	107
abstract_inverted_index.quantities	208
abstract_inverted_index.reasoning.	32
abstract_inverted_index.subsequent	31
abstract_inverted_index.supervised	187
abstract_inverted_index.understand	23
abstract_inverted_index.benchmarks,	174
abstract_inverted_index.fine-tuning	188
abstract_inverted_index.generation.	66
abstract_inverted_index.investigate	196
abstract_inverted_index.performance	177
abstract_inverted_index.pre-trained	11
abstract_inverted_index.substantial	176
abstract_inverted_index.descriptions	111, 161
abstract_inverted_index.high-quality	37
abstract_inverted_index.information,	143
abstract_inverted_index.self-improve	137
abstract_inverted_index.single-modal	53
abstract_inverted_index.specifically	98
abstract_inverted_index.step-by-step	121
abstract_inverted_index.Comprehension	91
abstract_inverted_index.Self-Training	88
abstract_inverted_index.Self-training	47
abstract_inverted_index.demonstrating	175
abstract_inverted_index.dis-preferred	124
abstract_inverted_index.effectiveness	168
abstract_inverted_index.self-training	69, 96
abstract_inverted_index.comprehension.	101
abstract_inverted_index.self-generated	159
abstract_inverted_index.self-training.	213
abstract_inverted_index.labor-intensive	44
abstract_inverted_index.self-constructs	105
abstract_inverted_index.vision-language	38
abstract_inverted_index.instruction-tuning	154
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	8
citation_normalized_percentile