Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.1609/aaai.v39i4.32471
Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.
Related Topics
- Type
- article
- Language
- en
- Landing Page
- https://doi.org/10.1609/aaai.v39i4.32471
- https://ojs.aaai.org/index.php/AAAI/article/download/32471/34626
- OA Status
- diamond
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4409368351
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4409368351Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.1609/aaai.v39i4.32471Digital Object Identifier
- Title
-
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language ModelsWork title
- Type
-
articleOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-11Full publication date if available
- Authors
-
Quang-Hung Le, Long Hoang Dang, Ngan Le, Truyen Tran, Thao Minh LeList of authors in order
- Landing page
-
https://doi.org/10.1609/aaai.v39i4.32471Publisher landing page
- PDF URL
-
https://ojs.aaai.org/index.php/AAAI/article/download/32471/34626Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
diamondOpen access status per OpenAlex
- OA URL
-
https://ojs.aaai.org/index.php/AAAI/article/download/32471/34626Direct OA link when available
- Concepts
-
Computer science, Grounded theory, Natural language processing, Artificial intelligence, Linguistics, Cognitive science, Psychology, Sociology, Qualitative research, Philosophy, AnthropologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4409368351 |
|---|---|
| doi | https://doi.org/10.1609/aaai.v39i4.32471 |
| ids.doi | https://doi.org/10.1609/aaai.v39i4.32471 |
| ids.openalex | https://openalex.org/W4409368351 |
| fwci | 0.0 |
| type | article |
| title | Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models |
| biblio.issue | 4 |
| biblio.volume | 39 |
| biblio.last_page | 4481 |
| biblio.first_page | 4473 |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9962999820709229 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9693999886512756 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| topics[2].id | https://openalex.org/T10028 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9517999887466431 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.4823407828807831 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C156325361 |
| concepts[1].level | 3 |
| concepts[1].score | 0.4460603892803192 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q1152864 |
| concepts[1].display_name | Grounded theory |
| concepts[2].id | https://openalex.org/C204321447 |
| concepts[2].level | 1 |
| concepts[2].score | 0.3976617455482483 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[2].display_name | Natural language processing |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.3965815603733063 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C41895202 |
| concepts[4].level | 1 |
| concepts[4].score | 0.35190245509147644 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q8162 |
| concepts[4].display_name | Linguistics |
| concepts[5].id | https://openalex.org/C188147891 |
| concepts[5].level | 1 |
| concepts[5].score | 0.32808929681777954 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q147638 |
| concepts[5].display_name | Cognitive science |
| concepts[6].id | https://openalex.org/C15744967 |
| concepts[6].level | 0 |
| concepts[6].score | 0.295815646648407 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[6].display_name | Psychology |
| concepts[7].id | https://openalex.org/C144024400 |
| concepts[7].level | 0 |
| concepts[7].score | 0.15677541494369507 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q21201 |
| concepts[7].display_name | Sociology |
| concepts[8].id | https://openalex.org/C190248442 |
| concepts[8].level | 2 |
| concepts[8].score | 0.1000903844833374 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q839486 |
| concepts[8].display_name | Qualitative research |
| concepts[9].id | https://openalex.org/C138885662 |
| concepts[9].level | 0 |
| concepts[9].score | 0.09033221006393433 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[9].display_name | Philosophy |
| concepts[10].id | https://openalex.org/C19165224 |
| concepts[10].level | 1 |
| concepts[10].score | 0.06004035472869873 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q23404 |
| concepts[10].display_name | Anthropology |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.4823407828807831 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/grounded-theory |
| keywords[1].score | 0.4460603892803192 |
| keywords[1].display_name | Grounded theory |
| keywords[2].id | https://openalex.org/keywords/natural-language-processing |
| keywords[2].score | 0.3976617455482483 |
| keywords[2].display_name | Natural language processing |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.3965815603733063 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/linguistics |
| keywords[4].score | 0.35190245509147644 |
| keywords[4].display_name | Linguistics |
| keywords[5].id | https://openalex.org/keywords/cognitive-science |
| keywords[5].score | 0.32808929681777954 |
| keywords[5].display_name | Cognitive science |
| keywords[6].id | https://openalex.org/keywords/psychology |
| keywords[6].score | 0.295815646648407 |
| keywords[6].display_name | Psychology |
| keywords[7].id | https://openalex.org/keywords/sociology |
| keywords[7].score | 0.15677541494369507 |
| keywords[7].display_name | Sociology |
| keywords[8].id | https://openalex.org/keywords/qualitative-research |
| keywords[8].score | 0.1000903844833374 |
| keywords[8].display_name | Qualitative research |
| keywords[9].id | https://openalex.org/keywords/philosophy |
| keywords[9].score | 0.09033221006393433 |
| keywords[9].display_name | Philosophy |
| keywords[10].id | https://openalex.org/keywords/anthropology |
| keywords[10].score | 0.06004035472869873 |
| keywords[10].display_name | Anthropology |
| language | en |
| locations[0].id | doi:10.1609/aaai.v39i4.32471 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4210191458 |
| locations[0].source.issn | 2159-5399, 2374-3468 |
| locations[0].source.type | conference |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | 2159-5399 |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | Proceedings of the AAAI Conference on Artificial Intelligence |
| locations[0].source.host_organization | https://openalex.org/P4310320058 |
| locations[0].source.host_organization_name | Association for the Advancement of Artificial Intelligence |
| locations[0].source.host_organization_lineage | https://openalex.org/P4310320058 |
| locations[0].source.host_organization_lineage_names | Association for the Advancement of Artificial Intelligence |
| locations[0].license | |
| locations[0].pdf_url | https://ojs.aaai.org/index.php/AAAI/article/download/32471/34626 |
| locations[0].version | publishedVersion |
| locations[0].raw_type | journal-article |
| locations[0].license_id | |
| locations[0].is_accepted | True |
| locations[0].is_published | True |
| locations[0].raw_source_name | Proceedings of the AAAI Conference on Artificial Intelligence |
| locations[0].landing_page_url | https://doi.org/10.1609/aaai.v39i4.32471 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5073633711 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-4727-6859 |
| authorships[0].author.display_name | Quang-Hung Le |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Quang-Hung Le |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5008734340 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-2701-2365 |
| authorships[1].author.display_name | Long Hoang Dang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Long Hoang Dang |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5101154137 |
| authorships[2].author.orcid | https://orcid.org/0009-0006-9132-2479 |
| authorships[2].author.display_name | Ngan Le |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Ngan Hoang Le |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5085471517 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-6531-8907 |
| authorships[3].author.display_name | Truyen Tran |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Truyen Tran |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5079045166 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-8089-9962 |
| authorships[4].author.display_name | Thao Minh Le |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Thao Minh Le |
| authorships[4].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://ojs.aaai.org/index.php/AAAI/article/download/32471/34626 |
| open_access.oa_status | diamond |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9962999820709229 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W1546533203, https://openalex.org/W2054080977, https://openalex.org/W1511554945, https://openalex.org/W2015439768, https://openalex.org/W4361008414, https://openalex.org/W4396854307, https://openalex.org/W3040823075, https://openalex.org/W2092147963, https://openalex.org/W2620765995, https://openalex.org/W3204019825 |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.1609/aaai.v39i4.32471 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4210191458 |
| best_oa_location.source.issn | 2159-5399, 2374-3468 |
| best_oa_location.source.type | conference |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | 2159-5399 |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | Proceedings of the AAAI Conference on Artificial Intelligence |
| best_oa_location.source.host_organization | https://openalex.org/P4310320058 |
| best_oa_location.source.host_organization_name | Association for the Advancement of Artificial Intelligence |
| best_oa_location.source.host_organization_lineage | https://openalex.org/P4310320058 |
| best_oa_location.source.host_organization_lineage_names | Association for the Advancement of Artificial Intelligence |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://ojs.aaai.org/index.php/AAAI/article/download/32471/34626 |
| best_oa_location.version | publishedVersion |
| best_oa_location.raw_type | journal-article |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | True |
| best_oa_location.raw_source_name | Proceedings of the AAAI Conference on Artificial Intelligence |
| best_oa_location.landing_page_url | https://doi.org/10.1609/aaai.v39i4.32471 |
| primary_location.id | doi:10.1609/aaai.v39i4.32471 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4210191458 |
| primary_location.source.issn | 2159-5399, 2374-3468 |
| primary_location.source.type | conference |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | 2159-5399 |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | Proceedings of the AAAI Conference on Artificial Intelligence |
| primary_location.source.host_organization | https://openalex.org/P4310320058 |
| primary_location.source.host_organization_name | Association for the Advancement of Artificial Intelligence |
| primary_location.source.host_organization_lineage | https://openalex.org/P4310320058 |
| primary_location.source.host_organization_lineage_names | Association for the Advancement of Artificial Intelligence |
| primary_location.license | |
| primary_location.pdf_url | https://ojs.aaai.org/index.php/AAAI/article/download/32471/34626 |
| primary_location.version | publishedVersion |
| primary_location.raw_type | journal-article |
| primary_location.license_id | |
| primary_location.is_accepted | True |
| primary_location.is_published | True |
| primary_location.raw_source_name | Proceedings of the AAAI Conference on Artificial Intelligence |
| primary_location.landing_page_url | https://doi.org/10.1609/aaai.v39i4.32471 |
| publication_date | 2025-04-11 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 30, 47, 89, 95, 103 |
| abstract_inverted_index.By | 59 |
| abstract_inverted_index.To | 82 |
| abstract_inverted_index.at | 6 |
| abstract_inverted_index.in | 37 |
| abstract_inverted_index.of | 50, 106 |
| abstract_inverted_index.on | 121 |
| abstract_inverted_index.to | 33, 56, 71, 78 |
| abstract_inverted_index.we | 87 |
| abstract_inverted_index.Our | 44 |
| abstract_inverted_index.and | 17, 125 |
| abstract_inverted_index.but | 12 |
| abstract_inverted_index.our | 68, 115 |
| abstract_inverted_index.This | 22 |
| abstract_inverted_index.data | 90 |
| abstract_inverted_index.from | 54, 75, 99 |
| abstract_inverted_index.that | 93, 114 |
| abstract_inverted_index.this | 84 |
| abstract_inverted_index.wide | 104 |
| abstract_inverted_index.with | 14, 64 |
| abstract_inverted_index.Large | 1 |
| abstract_inverted_index.excel | 5 |
| abstract_inverted_index.lower | 76 |
| abstract_inverted_index.model | 69 |
| abstract_inverted_index.novel | 31, 96 |
| abstract_inverted_index.paper | 23 |
| abstract_inverted_index.range | 105 |
| abstract_inverted_index.LVLMs' | 35 |
| abstract_inverted_index.Models | 3 |
| abstract_inverted_index.Visual | 100 |
| abstract_inverted_index.across | 9 |
| abstract_inverted_index.inform | 79 |
| abstract_inverted_index.inputs | 11 |
| abstract_inverted_index.learns | 70 |
| abstract_inverted_index.levels | 77 |
| abstract_inverted_index.nested | 107 |
| abstract_inverted_index.pairs. | 110 |
| abstract_inverted_index.simple | 55 |
| abstract_inverted_index.tasks. | 43, 129 |
| abstract_inverted_index.visual | 41, 66, 123 |
| abstract_inverted_index.(LVLMs) | 4 |
| abstract_inverted_index.Genome, | 101 |
| abstract_inverted_index.PromViL | 116 |
| abstract_inverted_index.ability | 36 |
| abstract_inverted_index.between | 20 |
| abstract_inverted_index.complex | 57 |
| abstract_inverted_index.creates | 94 |
| abstract_inverted_index.dataset | 97 |
| abstract_inverted_index.derived | 98 |
| abstract_inverted_index.enhance | 34 |
| abstract_inverted_index.process | 92 |
| abstract_inverted_index.ranging | 53 |
| abstract_inverted_index.results | 112 |
| abstract_inverted_index.textual | 62 |
| abstract_inverted_index.various | 122 |
| abstract_inverted_index.Existing | 0 |
| abstract_inverted_index.aligning | 61 |
| abstract_inverted_index.approach | 45 |
| abstract_inverted_index.concepts | 8, 16 |
| abstract_inverted_index.grounded | 39 |
| abstract_inverted_index.learning | 85 |
| abstract_inverted_index.leverage | 72 |
| abstract_inverted_index.matching | 7 |
| abstract_inverted_index.process, | 86 |
| abstract_inverted_index.question | 127 |
| abstract_inverted_index.regions, | 67 |
| abstract_inverted_index.struggle | 13 |
| abstract_inverted_index.answering | 128 |
| abstract_inverted_index.baselines | 120 |
| abstract_inverted_index.concepts. | 58 |
| abstract_inverted_index.entities. | 21 |
| abstract_inverted_index.framework | 32, 117 |
| abstract_inverted_index.grounding | 124 |
| abstract_inverted_index.introduce | 88 |
| abstract_inverted_index.providing | 102 |
| abstract_inverted_index.reasoning | 42 |
| abstract_inverted_index.structure | 49 |
| abstract_inverted_index.(PromViL), | 29 |
| abstract_inverted_index.alignments | 28 |
| abstract_inverted_index.constructs | 46 |
| abstract_inverted_index.contextual | 73 |
| abstract_inverted_index.facilitate | 83 |
| abstract_inverted_index.generation | 91 |
| abstract_inverted_index.high-level | 18 |
| abstract_inverted_index.introduces | 24 |
| abstract_inverted_index.performing | 38 |
| abstract_inverted_index.reasoning. | 81 |
| abstract_inverted_index.Progressive | 25 |
| abstract_inverted_index.alignments, | 52 |
| abstract_inverted_index.demonstrate | 113 |
| abstract_inverted_index.information | 74 |
| abstract_inverted_index.multi-modal | 10, 51 |
| abstract_inverted_index.outperforms | 119 |
| abstract_inverted_index.Experimental | 111 |
| abstract_inverted_index.descriptions | 63 |
| abstract_inverted_index.hierarchical | 48 |
| abstract_inverted_index.higher-level | 80 |
| abstract_inverted_index.compositional | 15, 40, 108, 126 |
| abstract_inverted_index.corresponding | 65 |
| abstract_inverted_index.progressively | 60 |
| abstract_inverted_index.relationships | 19 |
| abstract_inverted_index.significantly | 118 |
| abstract_inverted_index.multi-granular | 26 |
| abstract_inverted_index.Vision-Language | 2, 27 |
| abstract_inverted_index.vision-language | 109 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile.value | 0.3817952 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | False |