NavBench: Probing Multimodal Large Language Models for Embodied Navigation Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2506.01031
Multimodal Large Language Models (MLLMs) have demonstrated strong generalization in vision-language tasks, yet their ability to understand and act within embodied environments remains underexplored. We present NavBench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. NavBench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs' outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o performs well across tasks, while lighter open-source models succeed in simpler cases. Results also show that models with higher comprehension scores tend to achieve better execution performance. Providing map-based context improves decision accuracy, especially in medium-difficulty scenarios. However, most models struggle with temporal understanding, particularly in estimating progress during navigation, which may pose a key challenge.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2506.01031
- https://arxiv.org/pdf/2506.01031
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414894481
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414894481Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2506.01031Digital Object Identifier
- Title
-
NavBench: Probing Multimodal Large Language Models for Embodied NavigationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-06-01Full publication date if available
- Authors
-
Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, Qi WuList of authors in order
- Landing page
-
https://arxiv.org/abs/2506.01031Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2506.01031Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2506.01031Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414894481 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2506.01031 |
| ids.doi | https://doi.org/10.48550/arxiv.2506.01031 |
| ids.openalex | https://openalex.org/W4414894481 |
| fwci | |
| type | preprint |
| title | NavBench: Probing Multimodal Large Language Models for Embodied Navigation |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9926000237464905 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T12031 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9605000019073486 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Speech and dialogue systems |
| topics[2].id | https://openalex.org/T10028 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9575999975204468 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2506.01031 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2506.01031 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2506.01031 |
| locations[1].id | doi:10.48550/arxiv.2506.01031 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2506.01031 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5049868894 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-5606-0702 |
| authorships[0].author.display_name | Yanyuan Qiao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Qiao, Yanyuan |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5064769616 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-2044-6635 |
| authorships[1].author.display_name | Haodong Hong |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Hong, Haodong |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5101458465 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Wenqi Lyu |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Lyu, Wenqi |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5000446896 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-5133-6562 |
| authorships[3].author.display_name | Dong An |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | An, Dong |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100432246 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-5135-4886 |
| authorships[4].author.display_name | Siqi Zhang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhang, Siqi |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5011835422 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-6644-1250 |
| authorships[5].author.display_name | Yutong Xie |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Xie, Yutong |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5101712695 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-5249-0326 |
| authorships[6].author.display_name | Xinyu Wang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Wang, Xinyu |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5060958969 |
| authorships[7].author.orcid | https://orcid.org/0000-0003-3631-256X |
| authorships[7].author.display_name | Qi Wu |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Wu, Qi |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2506.01031 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | NavBench: Probing Multimodal Large Language Models for Embodied Navigation |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9926000237464905 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2506.01031 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2506.01031 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2506.01031 |
| primary_location.id | pmh:oai:arXiv.org:2506.01031 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2506.01031 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2506.01031 |
| publication_date | 2025-06-01 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 27, 93, 165 |
| abstract_inverted_index.72 | 77 |
| abstract_inverted_index.To | 87 |
| abstract_inverted_index.We | 24, 102 |
| abstract_inverted_index.by | 81 |
| abstract_inverted_index.in | 9, 73, 121, 146, 157 |
| abstract_inverted_index.of | 35, 42 |
| abstract_inverted_index.to | 15, 29, 134 |
| abstract_inverted_index.we | 91 |
| abstract_inverted_index.(1) | 45 |
| abstract_inverted_index.(2) | 70 |
| abstract_inverted_index.432 | 74 |
| abstract_inverted_index.act | 18 |
| abstract_inverted_index.and | 17, 61, 69, 84, 106 |
| abstract_inverted_index.key | 166 |
| abstract_inverted_index.may | 163 |
| abstract_inverted_index.the | 31 |
| abstract_inverted_index.two | 43 |
| abstract_inverted_index.yet | 12 |
| abstract_inverted_index.also | 125 |
| abstract_inverted_index.both | 104 |
| abstract_inverted_index.have | 5 |
| abstract_inverted_index.into | 99 |
| abstract_inverted_index.most | 150 |
| abstract_inverted_index.pose | 164 |
| abstract_inverted_index.show | 126 |
| abstract_inverted_index.tend | 133 |
| abstract_inverted_index.that | 95, 110, 127 |
| abstract_inverted_index.well | 113 |
| abstract_inverted_index.with | 129, 153 |
| abstract_inverted_index.3,200 | 66 |
| abstract_inverted_index.Large | 1 |
| abstract_inverted_index.MLLMs | 36 |
| abstract_inverted_index.local | 62 |
| abstract_inverted_index.tasks | 53 |
| abstract_inverted_index.their | 13 |
| abstract_inverted_index.three | 50 |
| abstract_inverted_index.under | 37 |
| abstract_inverted_index.which | 162 |
| abstract_inverted_index.while | 116 |
| abstract_inverted_index.GPT-4o | 111 |
| abstract_inverted_index.MLLMs' | 97 |
| abstract_inverted_index.Models | 3 |
| abstract_inverted_index.across | 76, 114 |
| abstract_inverted_index.better | 136 |
| abstract_inverted_index.cases. | 123 |
| abstract_inverted_index.during | 160 |
| abstract_inverted_index.global | 55 |
| abstract_inverted_index.higher | 130 |
| abstract_inverted_index.indoor | 78 |
| abstract_inverted_index.models | 119, 128, 151 |
| abstract_inverted_index.pairs; | 68 |
| abstract_inverted_index.scores | 132 |
| abstract_inverted_index.strong | 7 |
| abstract_inverted_index.tasks, | 11, 115 |
| abstract_inverted_index.within | 19 |
| abstract_inverted_index.(MLLMs) | 4 |
| abstract_inverted_index.Results | 124 |
| abstract_inverted_index.ability | 14 |
| abstract_inverted_index.achieve | 135 |
| abstract_inverted_index.context | 141 |
| abstract_inverted_index.finding | 109 |
| abstract_inverted_index.lighter | 117 |
| abstract_inverted_index.models, | 108 |
| abstract_inverted_index.outputs | 98 |
| abstract_inverted_index.present | 25 |
| abstract_inverted_index.remains | 22 |
| abstract_inverted_index.robotic | 100 |
| abstract_inverted_index.scenes, | 79 |
| abstract_inverted_index.simpler | 122 |
| abstract_inverted_index.succeed | 120 |
| abstract_inverted_index.support | 88 |
| abstract_inverted_index.through | 49 |
| abstract_inverted_index.However, | 149 |
| abstract_inverted_index.Language | 2 |
| abstract_inverted_index.NavBench | 40 |
| abstract_inverted_index.actions. | 101 |
| abstract_inverted_index.assessed | 48 |
| abstract_inverted_index.consists | 41 |
| abstract_inverted_index.converts | 96 |
| abstract_inverted_index.covering | 65 |
| abstract_inverted_index.decision | 143 |
| abstract_inverted_index.embodied | 20, 32 |
| abstract_inverted_index.episodes | 75 |
| abstract_inverted_index.evaluate | 30, 103 |
| abstract_inverted_index.grounded | 52 |
| abstract_inverted_index.improves | 142 |
| abstract_inverted_index.performs | 112 |
| abstract_inverted_index.pipeline | 94 |
| abstract_inverted_index.progress | 59, 159 |
| abstract_inverted_index.spatial, | 82 |
| abstract_inverted_index.struggle | 152 |
| abstract_inverted_index.temporal | 58, 154 |
| abstract_inverted_index.NavBench, | 26 |
| abstract_inverted_index.Providing | 139 |
| abstract_inverted_index.accuracy, | 144 |
| abstract_inverted_index.benchmark | 28 |
| abstract_inverted_index.execution | 72, 85, 137 |
| abstract_inverted_index.including | 54 |
| abstract_inverted_index.introduce | 92 |
| abstract_inverted_index.map-based | 140 |
| abstract_inverted_index.settings. | 39 |
| abstract_inverted_index.zero-shot | 38 |
| abstract_inverted_index.Multimodal | 0 |
| abstract_inverted_index.alignment, | 57 |
| abstract_inverted_index.challenge. | 167 |
| abstract_inverted_index.cognitive, | 83 |
| abstract_inverted_index.especially | 145 |
| abstract_inverted_index.estimating | 158 |
| abstract_inverted_index.navigation | 33, 46 |
| abstract_inverted_index.real-world | 89 |
| abstract_inverted_index.reasoning, | 64 |
| abstract_inverted_index.scenarios. | 148 |
| abstract_inverted_index.stratified | 80 |
| abstract_inverted_index.understand | 16 |
| abstract_inverted_index.cognitively | 51 |
| abstract_inverted_index.complexity. | 86 |
| abstract_inverted_index.components: | 44 |
| abstract_inverted_index.deployment, | 90 |
| abstract_inverted_index.estimation, | 60 |
| abstract_inverted_index.instruction | 56 |
| abstract_inverted_index.navigation, | 161 |
| abstract_inverted_index.open-source | 107, 118 |
| abstract_inverted_index.proprietary | 105 |
| abstract_inverted_index.capabilities | 34 |
| abstract_inverted_index.demonstrated | 6 |
| abstract_inverted_index.environments | 21 |
| abstract_inverted_index.particularly | 156 |
| abstract_inverted_index.performance. | 138 |
| abstract_inverted_index.step-by-step | 71 |
| abstract_inverted_index.comprehension | 131 |
| abstract_inverted_index.comprehension, | 47 |
| abstract_inverted_index.generalization | 8 |
| abstract_inverted_index.underexplored. | 23 |
| abstract_inverted_index.understanding, | 155 |
| abstract_inverted_index.question-answer | 67 |
| abstract_inverted_index.vision-language | 10 |
| abstract_inverted_index.medium-difficulty | 147 |
| abstract_inverted_index.observation-action | 63 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |