Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2402.12728
Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2402.12728
- https://arxiv.org/pdf/2402.12728
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4392019961
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4392019961Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2402.12728Digital Object Identifier
- Title
-
Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question AnsweringWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-02-20Full publication date if available
- Authors
-
Junnan Dong, Qinggang Zhang, Huachi Zhou, Daochen Zha, Pai Zheng, Xiao HuangList of authors in order
- Landing page
-
https://arxiv.org/abs/2402.12728Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2402.12728Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2402.12728Direct OA link when available
- Concepts
-
Question answering, Modality (human–computer interaction), Computer science, Natural language processing, Artificial intelligence, Information retrievalTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4392019961 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2402.12728 |
| ids.doi | https://doi.org/10.48550/arxiv.2402.12728 |
| ids.openalex | https://openalex.org/W4392019961 |
| fwci | |
| type | preprint |
| title | Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9975000023841858 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T10757 |
| topics[1].field.id | https://openalex.org/fields/33 |
| topics[1].field.display_name | Social Sciences |
| topics[1].score | 0.9607999920845032 |
| topics[1].domain.id | https://openalex.org/domains/2 |
| topics[1].domain.display_name | Social Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/3305 |
| topics[1].subfield.display_name | Geography, Planning and Development |
| topics[1].display_name | Geographic Information Systems Studies |
| topics[2].id | https://openalex.org/T12031 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9564999938011169 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Speech and dialogue systems |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C44291984 |
| concepts[0].level | 2 |
| concepts[0].score | 0.867228627204895 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1074173 |
| concepts[0].display_name | Question answering |
| concepts[1].id | https://openalex.org/C2780226545 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7313520908355713 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q6888030 |
| concepts[1].display_name | Modality (human–computer interaction) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.6808149814605713 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C204321447 |
| concepts[3].level | 1 |
| concepts[3].score | 0.5263082981109619 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[3].display_name | Natural language processing |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.3771756589412689 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C23123220 |
| concepts[5].level | 1 |
| concepts[5].score | 0.32786625623703003 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q816826 |
| concepts[5].display_name | Information retrieval |
| keywords[0].id | https://openalex.org/keywords/question-answering |
| keywords[0].score | 0.867228627204895 |
| keywords[0].display_name | Question answering |
| keywords[1].id | https://openalex.org/keywords/modality |
| keywords[1].score | 0.7313520908355713 |
| keywords[1].display_name | Modality (human–computer interaction) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.6808149814605713 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/natural-language-processing |
| keywords[3].score | 0.5263082981109619 |
| keywords[3].display_name | Natural language processing |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.3771756589412689 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/information-retrieval |
| keywords[5].score | 0.32786625623703003 |
| keywords[5].display_name | Information retrieval |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2402.12728 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2402.12728 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2402.12728 |
| locations[1].id | doi:10.48550/arxiv.2402.12728 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2402.12728 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5038326486 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-2117-6083 |
| authorships[0].author.display_name | Junnan Dong |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Dong, Junnan |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5058417704 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-1536-6529 |
| authorships[1].author.display_name | Qinggang Zhang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhang, Qinggang |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5049663232 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-8301-8470 |
| authorships[2].author.display_name | Huachi Zhou |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Zhou, Huachi |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5058071176 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-6677-7504 |
| authorships[3].author.display_name | Daochen Zha |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Zha, Daochen |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5079101040 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-2329-8634 |
| authorships[4].author.display_name | Pai Zheng |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zheng, Pai |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5100857259 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-0686-9113 |
| authorships[5].author.display_name | Xiao Huang |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Huang, Xiao |
| authorships[5].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2402.12728 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9975000023841858 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W2384605597, https://openalex.org/W2387743295, https://openalex.org/W2115758952, https://openalex.org/W3082787378, https://openalex.org/W2136007095, https://openalex.org/W2366230879, https://openalex.org/W3208425359, https://openalex.org/W2349927912, https://openalex.org/W3159777597, https://openalex.org/W3204019825 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2402.12728 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2402.12728 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2402.12728 |
| primary_location.id | pmh:oai:arXiv.org:2402.12728 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2402.12728 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2402.12728 |
| publication_date | 2024-02-20 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.A | 126 |
| abstract_inverted_index.a | 66, 91, 103, 113, 151 |
| abstract_inverted_index.It | 75 |
| abstract_inverted_index.To | 61 |
| abstract_inverted_index.We | 111, 138 |
| abstract_inverted_index.an | 33 |
| abstract_inverted_index.as | 32, 147 |
| abstract_inverted_index.be | 55 |
| abstract_inverted_index.by | 117, 161 |
| abstract_inverted_index.in | 144 |
| abstract_inverted_index.is | 132 |
| abstract_inverted_index.it | 37 |
| abstract_inverted_index.of | 176 |
| abstract_inverted_index.on | 169 |
| abstract_inverted_index.to | 9, 26, 97, 149 |
| abstract_inverted_index.we | 64, 89 |
| abstract_inverted_index.(i) | 88 |
| abstract_inverted_index.24x | 179 |
| abstract_inverted_index.KGs | 51 |
| abstract_inverted_index.and | 52, 84 |
| abstract_inverted_index.for | 58, 72, 80, 134 |
| abstract_inverted_index.has | 5 |
| abstract_inverted_index.may | 42 |
| abstract_inverted_index.the | 100, 119, 140, 163, 174 |
| abstract_inverted_index.two | 145, 170 |
| abstract_inverted_index.(ii) | 110 |
| abstract_inverted_index.KVQA | 73 |
| abstract_inverted_index.LLMs | 41, 71, 96 |
| abstract_inverted_index.MAIL | 177 |
| abstract_inverted_index.been | 6, 24 |
| abstract_inverted_index.both | 81 |
| abstract_inverted_index.have | 23 |
| abstract_inverted_index.into | 102 |
| abstract_inverted_index.less | 180 |
| abstract_inverted_index.show | 173 |
| abstract_inverted_index.with | 13, 70, 95, 106, 122, 178 |
| abstract_inverted_index.(iii) | 125 |
| abstract_inverted_index.LLMs, | 53 |
| abstract_inverted_index.While | 20 |
| abstract_inverted_index.e.g., | 16, 49 |
| abstract_inverted_index.graph | 105, 116, 129 |
| abstract_inverted_index.image | 82, 101 |
| abstract_inverted_index.large | 28 |
| abstract_inverted_index.novel | 67 |
| abstract_inverted_index.scene | 104 |
| abstract_inverted_index.since | 40 |
| abstract_inverted_index.tight | 152 |
| abstract_inverted_index.while | 155 |
| abstract_inverted_index.(KGs). | 19 |
| abstract_inverted_index.(KVQA) | 4 |
| abstract_inverted_index.(LLMs) | 31 |
| abstract_inverted_index.answer | 10 |
| abstract_inverted_index.bridge | 150 |
| abstract_inverted_index.cannot | 54 |
| abstract_inverted_index.embody | 99 |
| abstract_inverted_index.facts. | 124 |
| abstract_inverted_index.fusion | 131, 164 |
| abstract_inverted_index.graphs | 18, 146 |
| abstract_inverted_index.medium | 130 |
| abstract_inverted_index.models | 30 |
| abstract_inverted_index.shared | 141 |
| abstract_inverted_index.tackle | 62 |
| abstract_inverted_index.these, | 63 |
| abstract_inverted_index.visual | 1, 11, 108 |
| abstract_inverted_index.within | 165 |
| abstract_inverted_index.(MAIL). | 74 |
| abstract_inverted_index.aligned | 57 |
| abstract_inverted_index.complex | 59 |
| abstract_inverted_index.concept | 115 |
| abstract_inverted_index.coupled | 114 |
| abstract_inverted_index.densely | 98 |
| abstract_inverted_index.fusion. | 137 |
| abstract_inverted_index.images, | 50 |
| abstract_inverted_index.linking | 118 |
| abstract_inverted_index.mediums | 148 |
| abstract_inverted_index.present | 65 |
| abstract_inverted_index.propose | 90 |
| abstract_inverted_index.readily | 56 |
| abstract_inverted_index.remains | 38 |
| abstract_inverted_index.several | 21 |
| abstract_inverted_index.source, | 36 |
| abstract_inverted_index.studied | 8 |
| abstract_inverted_index.utilize | 139 |
| abstract_inverted_index.attempts | 22 |
| abstract_inverted_index.datasets | 172 |
| abstract_inverted_index.designed | 133 |
| abstract_inverted_index.detailed | 107 |
| abstract_inverted_index.entities | 121, 143 |
| abstract_inverted_index.external | 14, 123 |
| abstract_inverted_index.generate | 43 |
| abstract_inverted_index.implicit | 34 |
| abstract_inverted_index.language | 29 |
| abstract_inverted_index.learning | 160 |
| abstract_inverted_index.leverage | 27 |
| abstract_inverted_index.mediums. | 166 |
| abstract_inverted_index.multiple | 46 |
| abstract_inverted_index.proposed | 25 |
| abstract_inverted_index.question | 2 |
| abstract_inverted_index.sources, | 48 |
| abstract_inverted_index.strategy | 94 |
| abstract_inverted_index.tailored | 127 |
| abstract_inverted_index.Extensive | 167 |
| abstract_inverted_index.Moreover, | 45 |
| abstract_inverted_index.answering | 3 |
| abstract_inverted_index.benchmark | 171 |
| abstract_inverted_index.carefully | 76 |
| abstract_inverted_index.construct | 112 |
| abstract_inverted_index.exchange, | 154 |
| abstract_inverted_index.features; | 109 |
| abstract_inverted_index.knowledge | 17, 35, 47, 79, 85 |
| abstract_inverted_index.leverages | 77 |
| abstract_inverted_index.maximally | 156 |
| abstract_inverted_index.mentioned | 120, 142 |
| abstract_inverted_index.prompting | 93 |
| abstract_inverted_index.questions | 12 |
| abstract_inverted_index.two-stage | 92 |
| abstract_inverted_index.insightful | 158 |
| abstract_inverted_index.knowledge, | 15 |
| abstract_inverted_index.multimodal | 78, 136 |
| abstract_inverted_index.preserving | 157 |
| abstract_inverted_index.reasoning. | 86 |
| abstract_inverted_index.resources. | 181 |
| abstract_inverted_index.scenarios. | 60 |
| abstract_inverted_index.sufficient | 135 |
| abstract_inverted_index.challenging | 39 |
| abstract_inverted_index.experiments | 168 |
| abstract_inverted_index.extensively | 7 |
| abstract_inverted_index.integration | 69 |
| abstract_inverted_index.inter-modal | 153 |
| abstract_inverted_index.intra-modal | 159 |
| abstract_inverted_index.superiority | 175 |
| abstract_inverted_index.constraining | 162 |
| abstract_inverted_index.Specifically, | 87 |
| abstract_inverted_index.understanding | 83 |
| abstract_inverted_index.modality-aware | 68 |
| abstract_inverted_index.pseudo-siamese | 128 |
| abstract_inverted_index.Knowledge-based | 0 |
| abstract_inverted_index.hallucinations. | 44 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |