MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning Article Swipe

PDF

Xiangyu Zhao , Xiangtai Li , Haodong Duan , Haian Huang , Yining Li , Kai Chen , Hua Yang ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2406.17770

Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.

Related Topics

Computer Science

Artificial Intelligence

Programming Language

Concepts

Granularity Computer science Artificial intelligence Programming language

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2406.17770
PDF: https://arxiv.org/pdf/2406.17770
OA Status: green
Cited By: 1
Related Works: 10
OpenAlex ID: https://openalex.org/W4400065285

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4400065285

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2406.17770

Digital Object Identifier
Title: MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-06-25

Full publication date if available
Authors: Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang

List of authors in order
Landing page: https://arxiv.org/abs/2406.17770

Publisher landing page
PDF URL: https://arxiv.org/pdf/2406.17770

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2406.17770

Direct OA link when available
Concepts: Granularity, Computer science, Artificial intelligence, Programming language

Top concepts (fields/topics) attached by OpenAlex
Cited by: 1

Total citation count in OpenAlex
Citations by year (recent): 2025: 1

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4400065285
doi	https://doi.org/10.48550/arxiv.2406.17770
ids.doi	https://doi.org/10.48550/arxiv.2406.17770
ids.openalex	https://openalex.org/W4400065285
fwci
type	preprint
title	MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T11439
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9182999730110168
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1707
topics[0].subfield.display_name	Computer Vision and Pattern Recognition
topics[0].display_name	Video Analysis and Summarization
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C177774035
concepts[0].level	2
concepts[0].score	0.8803043365478516
concepts[0].wikidata	https://www.wikidata.org/wiki/Q1246948
concepts[0].display_name	Granularity
concepts[1].id	https://openalex.org/C41008148
concepts[1].level	0
concepts[1].score	0.6408911347389221
concepts[1].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[1].display_name	Computer science
concepts[2].id	https://openalex.org/C154945302
concepts[2].level	1
concepts[2].score	0.3327629566192627
concepts[2].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[2].display_name	Artificial intelligence
concepts[3].id	https://openalex.org/C199360897
concepts[3].level	1
concepts[3].score	0.23667901754379272
concepts[3].wikidata	https://www.wikidata.org/wiki/Q9143
concepts[3].display_name	Programming language
keywords[0].id	https://openalex.org/keywords/granularity
keywords[0].score	0.8803043365478516
keywords[0].display_name	Granularity
keywords[1].id	https://openalex.org/keywords/computer-science
keywords[1].score	0.6408911347389221
keywords[1].display_name	Computer science
keywords[2].id	https://openalex.org/keywords/artificial-intelligence
keywords[2].score	0.3327629566192627
keywords[2].display_name	Artificial intelligence
keywords[3].id	https://openalex.org/keywords/programming-language
keywords[3].score	0.23667901754379272
keywords[3].display_name	Programming language
language	en
locations[0].id	pmh:oai:arXiv.org:2406.17770
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2406.17770
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2406.17770
locations[1].id	doi:10.48550/arxiv.2406.17770
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2406.17770
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5100645854
authorships[0].author.orcid	https://orcid.org/0000-0003-2926-4416
authorships[0].author.display_name	Xiangyu Zhao
authorships[0].author_position	first
authorships[0].raw_author_name	Zhao, Xiangyu
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5089900108
authorships[1].author.orcid	https://orcid.org/0000-0002-0550-8247
authorships[1].author.display_name	Xiangtai Li
authorships[1].author_position	middle
authorships[1].raw_author_name	Li, Xiangtai
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5028468431
authorships[2].author.orcid	https://orcid.org/0000-0002-3052-4177
authorships[2].author.display_name	Haodong Duan
authorships[2].author_position	middle
authorships[2].raw_author_name	Duan, Haodong
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5058287590
authorships[3].author.orcid
authorships[3].author.display_name	Haian Huang
authorships[3].author_position	middle
authorships[3].raw_author_name	Huang, Haian
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5101546250
authorships[4].author.orcid	https://orcid.org/0000-0003-4761-2293
authorships[4].author.display_name	Yining Li
authorships[4].author_position	middle
authorships[4].raw_author_name	Li, Yining
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5048500768
authorships[5].author.orcid	https://orcid.org/0000-0002-3930-8294
authorships[5].author.display_name	Kai Chen
authorships[5].author_position	middle
authorships[5].raw_author_name	Chen, Kai
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5113387106
authorships[6].author.orcid	https://orcid.org/0000-0002-6229-8944
authorships[6].author.display_name	Hua Yang
authorships[6].author_position	last
authorships[6].raw_author_name	Yang, Hua
authorships[6].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2406.17770
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T11439
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9182999730110168
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1707
primary_topic.subfield.display_name	Computer Vision and Pattern Recognition
primary_topic.display_name	Video Analysis and Summarization
related_works	https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2931688134, https://openalex.org/W2377919138, https://openalex.org/W2378857091, https://openalex.org/W2999756192, https://openalex.org/W103652678, https://openalex.org/W4226090359, https://openalex.org/W2059697060, https://openalex.org/W936373746
cited_by_count	1
counts_by_year[0].year	2025
counts_by_year[0].cited_by_count	1
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2406.17770
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2406.17770
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2406.17770
primary_location.id	pmh:oai:arXiv.org:2406.17770
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2406.17770
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2406.17770
publication_date	2024-06-25
publication_year	2024
referenced_works_count	0
abstract_inverted_index.a	56, 90, 134
abstract_inverted_index.In	38
abstract_inverted_index.To	94
abstract_inverted_index.We	67, 130
abstract_inverted_index.an	44, 72
abstract_inverted_index.at	175
abstract_inverted_index.be	173
abstract_inverted_index.by	54, 111
abstract_inverted_index.in	9, 30
abstract_inverted_index.of	17, 71, 137, 162
abstract_inverted_index.on	117
abstract_inverted_index.to	22, 77, 143, 145
abstract_inverted_index.we	41, 102
abstract_inverted_index.The	170
abstract_inverted_index.and	64
abstract_inverted_index.are	20, 82
abstract_inverted_index.its	167
abstract_inverted_index.our	39
abstract_inverted_index.the	15, 49, 69, 97, 147
abstract_inverted_index.3.8B	142
abstract_inverted_index.34B,	144
abstract_inverted_index.MLLM	46
abstract_inverted_index.base	86
abstract_inverted_index.code	171
abstract_inverted_index.data	121
abstract_inverted_index.from	107, 141
abstract_inverted_index.have	5
abstract_inverted_index.made	6
abstract_inverted_index.that	33, 47, 157
abstract_inverted_index.then	83
abstract_inverted_index.wide	135
abstract_inverted_index.will	172
abstract_inverted_index.with	85, 133
abstract_inverted_index.Being	114
abstract_inverted_index.MLLMs	161
abstract_inverted_index.boxes	109
abstract_inverted_index.flow,	59
abstract_inverted_index.fused	84
abstract_inverted_index.large	1
abstract_inverted_index.tasks	32
abstract_inverted_index.their	28
abstract_inverted_index.these	18
abstract_inverted_index.which	26, 60, 81
abstract_inverted_index.across	153
abstract_inverted_index.fusion	92
abstract_inverted_index.limits	27
abstract_inverted_index.models	3, 19
abstract_inverted_index.object	99
abstract_inverted_index.refine	96
abstract_inverted_index.sizes,	165
abstract_inverted_index.solely	116
abstract_inverted_index.study,	40
abstract_inverted_index.tasks.	13
abstract_inverted_index.vision	58
abstract_inverted_index.visual	11, 36, 51, 75, 87
abstract_inverted_index.(MLLMs)	4
abstract_inverted_index.capture	78
abstract_inverted_index.derived	106
abstract_inverted_index.encoder	76
abstract_inverted_index.further	95
abstract_inverted_index.images,	25
abstract_inverted_index.model's	50, 98, 148
abstract_inverted_index.offline	112
abstract_inverted_index.present	42
abstract_inverted_index.process	23
abstract_inverted_index.propose	68
abstract_inverted_index.ranging	140
abstract_inverted_index.skills.	129
abstract_inverted_index.strides	8
abstract_inverted_index.through	89, 122
abstract_inverted_index.trained	115
abstract_inverted_index.tuning,	124
abstract_inverted_index.variety	136
abstract_inverted_index.various	10
abstract_inverted_index.However,	14
abstract_inverted_index.MG-LLaVA	125, 132, 158
abstract_inverted_index.bounding	108
abstract_inverted_index.detailed	35
abstract_inverted_index.details,	80
abstract_inverted_index.enhances	48
abstract_inverted_index.evaluate	146
abstract_inverted_index.existing	160
abstract_inverted_index.features	88, 105
abstract_inverted_index.includes	61
abstract_inverted_index.language	2, 138
abstract_inverted_index.majority	16
abstract_inverted_index.multiple	154
abstract_inverted_index.network.	93
abstract_inverted_index.publicly	118
abstract_inverted_index.Conv-Gate	91
abstract_inverted_index.Extensive	151
abstract_inverted_index.MG-LLaVA,	43
abstract_inverted_index.available	119, 174
abstract_inverted_index.efficacy.	169
abstract_inverted_index.encoders,	139
abstract_inverted_index.features.	66
abstract_inverted_index.parameter	164
abstract_inverted_index.abilities,	101
abstract_inverted_index.additional	73
abstract_inverted_index.benchmarks	155
abstract_inverted_index.comparable	163
abstract_inverted_index.detectors.	113
abstract_inverted_index.identified	110
abstract_inverted_index.innovative	45
abstract_inverted_index.multimodal	120
abstract_inverted_index.perception	31, 128
abstract_inverted_index.processing	52
abstract_inverted_index.remarkable	168
abstract_inverted_index.showcasing	166
abstract_inverted_index.Multi-modal	0
abstract_inverted_index.constrained	21
abstract_inverted_index.demonstrate	156
abstract_inverted_index.evaluations	152
abstract_inverted_index.exceptional	127
abstract_inverted_index.incorporate	103
abstract_inverted_index.instantiate	131
abstract_inverted_index.instruction	123
abstract_inverted_index.integration	70
abstract_inverted_index.necessitate	34
abstract_inverted_index.outperforms	159
abstract_inverted_index.performance	149
abstract_inverted_index.recognition	100
abstract_inverted_index.significant	7
abstract_inverted_index.capabilities	53
abstract_inverted_index.demonstrates	126
abstract_inverted_index.fine-grained	79
abstract_inverted_index.information.	37
abstract_inverted_index.object-level	104
abstract_inverted_index.effectiveness	29
abstract_inverted_index.incorporating	55
abstract_inverted_index.understanding	12
abstract_inverted_index.low-resolution	24
abstract_inverted_index.object-centric	65
abstract_inverted_index.high-resolution	74
abstract_inverted_index.low-resolution,	62
abstract_inverted_index.comprehensively.	150
abstract_inverted_index.high-resolution,	63
abstract_inverted_index.multi-granularity	57
abstract_inverted_index.https://github.com/PhoenixZ810/MG-LLaVA.	176
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	7
citation_normalized_percentile