ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2503.06542
Unified multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate'' algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at https://github.com/finyorko/armor.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2503.06542
- https://arxiv.org/pdf/2503.06542
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4417094989
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4417094989Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2503.06542Digital Object Identifier
- Title
-
ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation CapabilityWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-03-09Full publication date if available
- Authors
-
Jianwen Sun, Yulong Feng, Fanrui Zhang, Jingjing Ai, S. Kevin Zhou, Shenglin Zhang, Kaipeng ZhangList of authors in order
- Landing page
-
https://arxiv.org/abs/2503.06542Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2503.06542Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2503.06542Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4417094989 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2503.06542 |
| ids.doi | https://doi.org/10.48550/arxiv.2503.06542 |
| ids.openalex | https://openalex.org/W4417094989 |
| fwci | |
| type | preprint |
| title | ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2503.06542 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2503.06542 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2503.06542 |
| locations[1].id | doi:10.48550/arxiv.2503.06542 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2503.06542 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5005771841 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Jianwen Sun |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Sun, Jianwen |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5050688781 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-8877-5027 |
| authorships[1].author.display_name | Yulong Feng |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Feng, Yukang |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5037011537 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Fanrui Zhang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Zhang, Fanrui |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5038964194 |
| authorships[3].author.orcid | https://orcid.org/0009-0005-1780-8553 |
| authorships[3].author.display_name | Jingjing Ai |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Ai, Jiaxin |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5028465673 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-6881-4444 |
| authorships[4].author.display_name | S. Kevin Zhou |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Zhou, Sizhuo |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5115595269 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Shenglin Zhang |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Zhang, Shenglin |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5101647736 |
| authorships[6].author.orcid | https://orcid.org/0009-0006-3937-1441 |
| authorships[6].author.display_name | Kaipeng Zhang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Zhang, Kaipeng |
| authorships[6].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2503.06542 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-12-07T17:40:28.077893 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2503.06542 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2503.06542 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2503.06542 |
| primary_location.id | pmh:oai:arXiv.org:2503.06542 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2503.06542 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2503.06542 |
| publication_date | 2025-03-09 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 44, 81, 109, 127 |
| abstract_inverted_index.We | 41 |
| abstract_inverted_index.an | 76 |
| abstract_inverted_index.at | 183 |
| abstract_inverted_index.be | 180 |
| abstract_inverted_index.by | 56 |
| abstract_inverted_index.in | 10 |
| abstract_inverted_index.is | 84, 115 |
| abstract_inverted_index.of | 13 |
| abstract_inverted_index.on | 154 |
| abstract_inverted_index.or | 129 |
| abstract_inverted_index.to | 21, 37, 86, 131, 134, 166 |
| abstract_inverted_index.we | 125 |
| abstract_inverted_index.(1) | 72 |
| abstract_inverted_index.(2) | 105 |
| abstract_inverted_index.(3) | 120 |
| abstract_inverted_index.For | 73, 106, 121 |
| abstract_inverted_index.Our | 177 |
| abstract_inverted_index.and | 3, 15, 27, 34, 46, 54, 92 |
| abstract_inverted_index.are | 19 |
| abstract_inverted_index.for | 95, 117 |
| abstract_inverted_index.how | 130 |
| abstract_inverted_index.the | 11, 122, 155 |
| abstract_inverted_index.area | 12 |
| abstract_inverted_index.both | 24, 52 |
| abstract_inverted_index.code | 178 |
| abstract_inverted_index.from | 69 |
| abstract_inverted_index.have | 5 |
| abstract_inverted_index.much | 8 |
| abstract_inverted_index.pure | 47 |
| abstract_inverted_index.soon | 182 |
| abstract_inverted_index.that | 50, 161 |
| abstract_inverted_index.will | 179 |
| abstract_inverted_index.with | 80, 101, 138, 168 |
| abstract_inverted_index.ARMOR | 65, 162 |
| abstract_inverted_index.MLLMs | 68, 137, 165 |
| abstract_inverted_index.UniMs | 18, 167 |
| abstract_inverted_index.based | 153 |
| abstract_inverted_index.data, | 108 |
| abstract_inverted_index.image | 170 |
| abstract_inverted_index.large | 60 |
| abstract_inverted_index.learn | 23 |
| abstract_inverted_index.model | 74 |
| abstract_inverted_index.often | 35 |
| abstract_inverted_index.space | 89 |
| abstract_inverted_index.their | 144 |
| abstract_inverted_index.three | 70, 149 |
| abstract_inverted_index.unify | 87 |
| abstract_inverted_index.using | 173 |
| abstract_inverted_index.while | 142 |
| abstract_inverted_index.ARMOR, | 43 |
| abstract_inverted_index.MLLMs. | 119 |
| abstract_inverted_index.``what | 128 |
| abstract_inverted_index.models | 62 |
| abstract_inverted_index.stages | 152 |
| abstract_inverted_index.vision | 14 |
| abstract_inverted_index.visual | 93 |
| abstract_inverted_index.Unified | 0 |
| abstract_inverted_index.dataset | 114 |
| abstract_inverted_index.empower | 135 |
| abstract_inverted_index.extends | 66 |
| abstract_inverted_index.limited | 174 |
| abstract_inverted_index.minimal | 102 |
| abstract_inverted_index.natural | 97 |
| abstract_inverted_index.present | 42 |
| abstract_inverted_index.propose | 126 |
| abstract_inverted_index.results | 159 |
| abstract_inverted_index.textual | 91 |
| abstract_inverted_index.through | 148 |
| abstract_inverted_index.(MLLMs). | 63 |
| abstract_inverted_index.Existing | 17 |
| abstract_inverted_index.achieves | 51 |
| abstract_inverted_index.curated, | 111 |
| abstract_inverted_index.dataset. | 157 |
| abstract_inverted_index.designed | 20 |
| abstract_inverted_index.enabling | 96 |
| abstract_inverted_index.existing | 58, 67, 136, 164 |
| abstract_inverted_index.generate | 38 |
| abstract_inverted_index.language | 61 |
| abstract_inverted_index.received | 7 |
| abstract_inverted_index.recently | 6 |
| abstract_inverted_index.released | 181 |
| abstract_inverted_index.struggle | 36 |
| abstract_inverted_index.training | 107, 123, 151, 175 |
| abstract_inverted_index.upgrades | 163 |
| abstract_inverted_index.algorithm | 133 |
| abstract_inverted_index.attention | 9 |
| abstract_inverted_index.collected | 116, 156 |
| abstract_inverted_index.demanding | 30 |
| abstract_inverted_index.embedding | 88 |
| abstract_inverted_index.framework | 49 |
| abstract_inverted_index.language. | 16 |
| abstract_inverted_index.mechanism | 83 |
| abstract_inverted_index.overhead. | 104 |
| abstract_inverted_index.promising | 169 |
| abstract_inverted_index.algorithm, | 124 |
| abstract_inverted_index.asymmetric | 77 |
| abstract_inverted_index.generate'' | 132 |
| abstract_inverted_index.generation | 4, 28, 55, 100, 140, 171 |
| abstract_inverted_index.introduced | 85 |
| abstract_inverted_index.modalities | 94 |
| abstract_inverted_index.multimodal | 1, 25, 59, 139, 145 |
| abstract_inverted_index.preserving | 143 |
| abstract_inverted_index.resources, | 33 |
| abstract_inverted_index.resources. | 176 |
| abstract_inverted_index.text-image | 98 |
| abstract_inverted_index.demonstrate | 160 |
| abstract_inverted_index.fine-tuning | 57, 118 |
| abstract_inverted_index.integrating | 90 |
| abstract_inverted_index.interleaved | 39, 99, 113 |
| abstract_inverted_index.progressive | 150 |
| abstract_inverted_index.substantial | 31 |
| abstract_inverted_index.text-image. | 40 |
| abstract_inverted_index.Experimental | 158 |
| abstract_inverted_index.architecture | 79 |
| abstract_inverted_index.capabilities | 141 |
| abstract_inverted_index.high-quality | 112 |
| abstract_inverted_index.meticulously | 110 |
| abstract_inverted_index.Specifically, | 64 |
| abstract_inverted_index.architecture, | 75 |
| abstract_inverted_index.capabilities, | 29, 147, 172 |
| abstract_inverted_index.computational | 32, 103 |
| abstract_inverted_index.perspectives: | 71 |
| abstract_inverted_index.understanding | 2, 26, 53, 146 |
| abstract_inverted_index.autoregressive | 48 |
| abstract_inverted_index.simultaneously | 22 |
| abstract_inverted_index.encoder-decoder | 78 |
| abstract_inverted_index.forward-switching | 82 |
| abstract_inverted_index.resource-efficient | 45 |
| abstract_inverted_index.https://github.com/finyorko/armor. | 184 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 7 |
| citation_normalized_percentile |