Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2311.17647
Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM) setting and VIM setting. Notably, we observe a significant performance disparity between the original TEM and VIM settings for open-source MLLMs, indicating that open-source MLLMs face greater challenges when text instruction is presented solely in image form. To address this issue, we train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2311.17647
- https://arxiv.org/pdf/2311.17647
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4389217785
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4389217785Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2311.17647Digital Object Identifier
- Title
-
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?Work title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-11-29Full publication date if available
- Authors
-
Yujie Lu, Xiujun Li, William Yang Wang, Yejin ChoiList of authors in order
- Landing page
-
https://arxiv.org/abs/2311.17647Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2311.17647Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2311.17647Direct OA link when available
- Concepts
-
Computer science, Embedding, Context (archaeology), Comprehension, Human–computer interaction, Artificial intelligence, Programming language, Paleontology, BiologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4389217785 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2311.17647 |
| ids.doi | https://doi.org/10.48550/arxiv.2311.17647 |
| ids.openalex | https://openalex.org/W4389217785 |
| fwci | |
| type | preprint |
| title | Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9944999814033508 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T13310 |
| topics[1].field.id | https://openalex.org/fields/12 |
| topics[1].field.display_name | Arts and Humanities |
| topics[1].score | 0.9456999897956848 |
| topics[1].domain.id | https://openalex.org/domains/2 |
| topics[1].domain.display_name | Social Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1203 |
| topics[1].subfield.display_name | Language and Linguistics |
| topics[1].display_name | Subtitles and Audiovisual Media |
| topics[2].id | https://openalex.org/T13629 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9072999954223633 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Text Readability and Simplification |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7457652688026428 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C41608201 |
| concepts[1].level | 2 |
| concepts[1].score | 0.49058642983436584 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q980509 |
| concepts[1].display_name | Embedding |
| concepts[2].id | https://openalex.org/C2779343474 |
| concepts[2].level | 2 |
| concepts[2].score | 0.4760068356990814 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[2].display_name | Context (archaeology) |
| concepts[3].id | https://openalex.org/C511192102 |
| concepts[3].level | 2 |
| concepts[3].score | 0.420653760433197 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q5156948 |
| concepts[3].display_name | Comprehension |
| concepts[4].id | https://openalex.org/C107457646 |
| concepts[4].level | 1 |
| concepts[4].score | 0.34468939900398254 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q207434 |
| concepts[4].display_name | Human–computer interaction |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.33177614212036133 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C199360897 |
| concepts[6].level | 1 |
| concepts[6].score | 0.11956566572189331 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[6].display_name | Programming language |
| concepts[7].id | https://openalex.org/C151730666 |
| concepts[7].level | 1 |
| concepts[7].score | 0.0 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q7205 |
| concepts[7].display_name | Paleontology |
| concepts[8].id | https://openalex.org/C86803240 |
| concepts[8].level | 0 |
| concepts[8].score | 0.0 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[8].display_name | Biology |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.7457652688026428 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/embedding |
| keywords[1].score | 0.49058642983436584 |
| keywords[1].display_name | Embedding |
| keywords[2].id | https://openalex.org/keywords/context |
| keywords[2].score | 0.4760068356990814 |
| keywords[2].display_name | Context (archaeology) |
| keywords[3].id | https://openalex.org/keywords/comprehension |
| keywords[3].score | 0.420653760433197 |
| keywords[3].display_name | Comprehension |
| keywords[4].id | https://openalex.org/keywords/human–computer-interaction |
| keywords[4].score | 0.34468939900398254 |
| keywords[4].display_name | Human–computer interaction |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.33177614212036133 |
| keywords[5].display_name | Artificial intelligence |
| keywords[6].id | https://openalex.org/keywords/programming-language |
| keywords[6].score | 0.11956566572189331 |
| keywords[6].display_name | Programming language |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2311.17647 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2311.17647 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2311.17647 |
| locations[1].id | doi:10.48550/arxiv.2311.17647 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2311.17647 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5036525093 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-0691-2129 |
| authorships[0].author.display_name | Yujie Lu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Lu, Yujie |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5021140826 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-7771-2725 |
| authorships[1].author.display_name | Xiujun Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Xiujun |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100702485 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-6153-8240 |
| authorships[2].author.display_name | William Yang Wang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wang, William Yang |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5102992157 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-3032-5378 |
| authorships[3].author.display_name | Yejin Choi |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Choi, Yejin |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2311.17647 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2023-12-01T00:00:00 |
| display_name | Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9944999814033508 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W2081900870, https://openalex.org/W2345479200, https://openalex.org/W2183306018, https://openalex.org/W2849310602, https://openalex.org/W3006008237, https://openalex.org/W2616627668, https://openalex.org/W2419146053, https://openalex.org/W4388890789, https://openalex.org/W2088247287, https://openalex.org/W3137121595 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2311.17647 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2311.17647 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2311.17647 |
| primary_location.id | pmh:oai:arXiv.org:2311.17647 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2311.17647 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2311.17647 |
| publication_date | 2023-11-29 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 77, 114 |
| abstract_inverted_index.In | 15 |
| abstract_inverted_index.To | 107 |
| abstract_inverted_index.We | 49 |
| abstract_inverted_index.in | 35, 64, 104, 125 |
| abstract_inverted_index.is | 101, 118 |
| abstract_inverted_index.on | 12, 42 |
| abstract_inverted_index.or | 47 |
| abstract_inverted_index.to | 52, 120 |
| abstract_inverted_index.we | 18, 75, 111 |
| abstract_inverted_index.TEM | 84 |
| abstract_inverted_index.VIM | 51, 72, 86 |
| abstract_inverted_index.and | 24, 60, 71, 85, 128 |
| abstract_inverted_index.can | 30 |
| abstract_inverted_index.for | 88 |
| abstract_inverted_index.how | 26 |
| abstract_inverted_index.not | 38 |
| abstract_inverted_index.the | 66, 82 |
| abstract_inverted_index.both | 65, 126 |
| abstract_inverted_index.data | 44 |
| abstract_inverted_index.face | 95 |
| abstract_inverted_index.have | 6 |
| abstract_inverted_index.such | 43 |
| abstract_inverted_index.text | 99 |
| abstract_inverted_index.that | 92, 117 |
| abstract_inverted_index.this | 16, 109 |
| abstract_inverted_index.well | 27 |
| abstract_inverted_index.when | 98 |
| abstract_inverted_index.(TEM) | 69 |
| abstract_inverted_index.MLLMs | 63, 94 |
| abstract_inverted_index.MMMU, | 59 |
| abstract_inverted_index.adapt | 50 |
| abstract_inverted_index.being | 39 |
| abstract_inverted_index.eight | 53 |
| abstract_inverted_index.form. | 106 |
| abstract_inverted_index.image | 105 |
| abstract_inverted_index.large | 2 |
| abstract_inverted_index.model | 116 |
| abstract_inverted_index.probe | 61 |
| abstract_inverted_index.shown | 7 |
| abstract_inverted_index.train | 112 |
| abstract_inverted_index.work, | 17 |
| abstract_inverted_index.(VIM), | 23 |
| abstract_inverted_index.MLLMs, | 90 |
| abstract_inverted_index.OKVQA, | 56 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.VISUAL | 20 |
| abstract_inverted_index.during | 45 |
| abstract_inverted_index.issue, | 110 |
| abstract_inverted_index.models | 4, 29 |
| abstract_inverted_index.robust | 122 |
| abstract_inverted_index.solely | 103 |
| abstract_inverted_index.tasks. | 14 |
| abstract_inverted_index.(MLLMs) | 5 |
| abstract_inverted_index.MM-Vet, | 57 |
| abstract_inverted_index.address | 108 |
| abstract_inverted_index.between | 81 |
| abstract_inverted_index.capable | 119 |
| abstract_inverted_index.conduct | 121 |
| abstract_inverted_index.despite | 37 |
| abstract_inverted_index.diverse | 62 |
| abstract_inverted_index.greater | 96 |
| abstract_inverted_index.observe | 76 |
| abstract_inverted_index.pixels, | 36 |
| abstract_inverted_index.setting | 70 |
| abstract_inverted_index.textual | 32 |
| abstract_inverted_index.trained | 41 |
| abstract_inverted_index.v-MLLM, | 113 |
| abstract_inverted_index.MODALITY | 21 |
| abstract_inverted_index.Notably, | 74 |
| abstract_inverted_index.language | 3 |
| abstract_inverted_index.original | 83 |
| abstract_inverted_index.provided | 34 |
| abstract_inverted_index.setting. | 73 |
| abstract_inverted_index.settings | 87 |
| abstract_inverted_index.disparity | 80 |
| abstract_inverted_index.following | 10, 124 |
| abstract_inverted_index.including | 55 |
| abstract_inverted_index.introduce | 19 |
| abstract_inverted_index.presented | 102 |
| abstract_inverted_index.promising | 8 |
| abstract_inverted_index.MathVista, | 58 |
| abstract_inverted_index.challenges | 97 |
| abstract_inverted_index.explicitly | 40 |
| abstract_inverted_index.indicating | 91 |
| abstract_inverted_index.multimodal | 1, 28 |
| abstract_inverted_index.understand | 31 |
| abstract_inverted_index.INSTRUCTION | 22 |
| abstract_inverted_index.benchmarks, | 54 |
| abstract_inverted_index.instruction | 9, 68, 100, 123 |
| abstract_inverted_index.investigate | 25 |
| abstract_inverted_index.open-source | 89, 93 |
| abstract_inverted_index.performance | 79 |
| abstract_inverted_index.pretraining | 46 |
| abstract_inverted_index.significant | 78 |
| abstract_inverted_index.capabilities | 11 |
| abstract_inverted_index.fine-tuning. | 48 |
| abstract_inverted_index.instructions | 33 |
| abstract_inverted_index.generalizable | 115 |
| abstract_inverted_index.instructions. | 130 |
| abstract_inverted_index.text-modality | 67, 127 |
| abstract_inverted_index.vision-language | 13 |
| abstract_inverted_index.visual-modality | 129 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.8500000238418579 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |