From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2409.05413
Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2409.05413
- https://arxiv.org/pdf/2409.05413
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403622429
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403622429Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2409.05413Digital Object Identifier
- Title
-
From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-09-09Full publication date if available
- Authors
-
Tomi Pulli, Stefan Thalhammer, Simon Schwaiger, Markus VinczeList of authors in order
- Landing page
-
https://arxiv.org/abs/2409.05413Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2409.05413Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2409.05413Direct OA link when available
- Concepts
-
Pose, Object (grammar), Artificial intelligence, Computer science, Computer vision, Estimation, Natural language processing, Engineering, Systems engineeringTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403622429 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2409.05413 |
| ids.doi | https://doi.org/10.48550/arxiv.2409.05413 |
| ids.openalex | https://openalex.org/W4403622429 |
| fwci | |
| type | preprint |
| title | From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9951000213623047 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T11398 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9907000064849854 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1709 |
| topics[1].subfield.display_name | Human-Computer Interaction |
| topics[1].display_name | Hand Gesture Recognition Systems |
| topics[2].id | https://openalex.org/T10627 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9769999980926514 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Advanced Image and Video Retrieval Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C52102323 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7401180267333984 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1671968 |
| concepts[0].display_name | Pose |
| concepts[1].id | https://openalex.org/C2781238097 |
| concepts[1].level | 2 |
| concepts[1].score | 0.633973240852356 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q175026 |
| concepts[1].display_name | Object (grammar) |
| concepts[2].id | https://openalex.org/C154945302 |
| concepts[2].level | 1 |
| concepts[2].score | 0.6330766677856445 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[2].display_name | Artificial intelligence |
| concepts[3].id | https://openalex.org/C41008148 |
| concepts[3].level | 0 |
| concepts[3].score | 0.5972266793251038 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[3].display_name | Computer science |
| concepts[4].id | https://openalex.org/C31972630 |
| concepts[4].level | 1 |
| concepts[4].score | 0.5390258431434631 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q844240 |
| concepts[4].display_name | Computer vision |
| concepts[5].id | https://openalex.org/C96250715 |
| concepts[5].level | 2 |
| concepts[5].score | 0.4905328154563904 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q965330 |
| concepts[5].display_name | Estimation |
| concepts[6].id | https://openalex.org/C204321447 |
| concepts[6].level | 1 |
| concepts[6].score | 0.4667432904243469 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[6].display_name | Natural language processing |
| concepts[7].id | https://openalex.org/C127413603 |
| concepts[7].level | 0 |
| concepts[7].score | 0.10871249437332153 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[7].display_name | Engineering |
| concepts[8].id | https://openalex.org/C201995342 |
| concepts[8].level | 1 |
| concepts[8].score | 0.0 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q682496 |
| concepts[8].display_name | Systems engineering |
| keywords[0].id | https://openalex.org/keywords/pose |
| keywords[0].score | 0.7401180267333984 |
| keywords[0].display_name | Pose |
| keywords[1].id | https://openalex.org/keywords/object |
| keywords[1].score | 0.633973240852356 |
| keywords[1].display_name | Object (grammar) |
| keywords[2].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[2].score | 0.6330766677856445 |
| keywords[2].display_name | Artificial intelligence |
| keywords[3].id | https://openalex.org/keywords/computer-science |
| keywords[3].score | 0.5972266793251038 |
| keywords[3].display_name | Computer science |
| keywords[4].id | https://openalex.org/keywords/computer-vision |
| keywords[4].score | 0.5390258431434631 |
| keywords[4].display_name | Computer vision |
| keywords[5].id | https://openalex.org/keywords/estimation |
| keywords[5].score | 0.4905328154563904 |
| keywords[5].display_name | Estimation |
| keywords[6].id | https://openalex.org/keywords/natural-language-processing |
| keywords[6].score | 0.4667432904243469 |
| keywords[6].display_name | Natural language processing |
| keywords[7].id | https://openalex.org/keywords/engineering |
| keywords[7].score | 0.10871249437332153 |
| keywords[7].display_name | Engineering |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2409.05413 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2409.05413 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2409.05413 |
| locations[1].id | doi:10.48550/arxiv.2409.05413 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2409.05413 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5081917294 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-1416-2257 |
| authorships[0].author.display_name | Tomi Pulli |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Pulli, Tessa |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5074166804 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-0008-430X |
| authorships[1].author.display_name | Stefan Thalhammer |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Thalhammer, Stefan |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5057783741 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Simon Schwaiger |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Schwaiger, Simon |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5013565399 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-2799-491X |
| authorships[3].author.display_name | Markus Vincze |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Vincze, Markus |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2409.05413 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-10-22T00:00:00 |
| display_name | From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9951000213623047 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W2123263858, https://openalex.org/W3127959533, https://openalex.org/W4387967917, https://openalex.org/W4387968151, https://openalex.org/W4386925306, https://openalex.org/W3132124459, https://openalex.org/W2946083937, https://openalex.org/W2736638679, https://openalex.org/W4313046826, https://openalex.org/W1968716783 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2409.05413 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2409.05413 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2409.05413 |
| primary_location.id | pmh:oai:arXiv.org:2409.05413 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2409.05413 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2409.05413 |
| publication_date | 2024-09-09 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 74, 92, 104, 115, 162 |
| abstract_inverted_index.6D | 68, 80 |
| abstract_inverted_index.In | 53 |
| abstract_inverted_index.To | 17 |
| abstract_inverted_index.We | 72, 133 |
| abstract_inverted_index.an | 45, 96, 123, 149 |
| abstract_inverted_index.as | 137 |
| abstract_inverted_index.by | 43 |
| abstract_inverted_index.in | 6, 40, 161 |
| abstract_inverted_index.is | 89 |
| abstract_inverted_index.of | 59, 95, 103, 125 |
| abstract_inverted_index.on | 99, 148 |
| abstract_inverted_index.to | 4, 14, 67, 90, 109, 156 |
| abstract_inverted_index.we | 56, 121, 154 |
| abstract_inverted_index.The | 87 |
| abstract_inverted_index.and | 19, 50, 63, 108, 143, 151 |
| abstract_inverted_index.are | 1 |
| abstract_inverted_index.for | 77, 128, 140 |
| abstract_inverted_index.map | 102 |
| abstract_inverted_index.new | 15 |
| abstract_inverted_index.our | 54 |
| abstract_inverted_index.the | 100, 111, 145 |
| abstract_inverted_index.NeRF | 106 |
| abstract_inverted_index.VLMs | 60 |
| abstract_inverted_index.have | 36 |
| abstract_inverted_index.idea | 88 |
| abstract_inverted_index.maps | 142 |
| abstract_inverted_index.must | 11 |
| abstract_inverted_index.plan | 155 |
| abstract_inverted_index.pose | 24, 70, 82, 112, 131 |
| abstract_inverted_index.such | 136 |
| abstract_inverted_index.take | 57 |
| abstract_inverted_index.they | 10 |
| abstract_inverted_index.this | 65 |
| abstract_inverted_index.with | 114 |
| abstract_inverted_index.adapt | 13 |
| abstract_inverted_index.based | 98 |
| abstract_inverted_index.cloud | 117 |
| abstract_inverted_index.grasp | 20 |
| abstract_inverted_index.image | 51 |
| abstract_inverted_index.input | 49 |
| abstract_inverted_index.novel | 21, 75 |
| abstract_inverted_index.point | 116 |
| abstract_inverted_index.poses | 27 |
| abstract_inverted_index.prior | 29 |
| abstract_inverted_index.shown | 37 |
| abstract_inverted_index.using | 84 |
| abstract_inverted_index.where | 9 |
| abstract_inverted_index.work, | 55 |
| abstract_inverted_index.(VLMs) | 35 |
| abstract_inverted_index.LERF's | 126 |
| abstract_inverted_index.Robots | 0 |
| abstract_inverted_index.coarse | 93 |
| abstract_inverted_index.derive | 91 |
| abstract_inverted_index.detect | 18 |
| abstract_inverted_index.input. | 52 |
| abstract_inverted_index.models | 34 |
| abstract_inverted_index.object | 69, 81, 97, 130 |
| abstract_inverted_index.vision | 32 |
| abstract_inverted_index.ability | 66 |
| abstract_inverted_index.between | 47 |
| abstract_inverted_index.compute | 110 |
| abstract_inverted_index.conduct | 157 |
| abstract_inverted_index.examine | 134 |
| abstract_inverted_index.method. | 119 |
| abstract_inverted_index.propose | 73 |
| abstract_inverted_index.provide | 122 |
| abstract_inverted_index.robotic | 158 |
| abstract_inverted_index.without | 28 |
| abstract_inverted_index.advances | 39 |
| abstract_inverted_index.analysis | 124 |
| abstract_inverted_index.estimate | 113 |
| abstract_inverted_index.grasping | 159 |
| abstract_inverted_index.interact | 5 |
| abstract_inverted_index.language | 33, 48, 85 |
| abstract_inverted_index.location | 94 |
| abstract_inverted_index.objects, | 22 |
| abstract_inverted_index.open-set | 129 |
| abstract_inverted_index.robotics | 41 |
| abstract_inverted_index.setting. | 164 |
| abstract_inverted_index.Recently, | 31 |
| abstract_inverted_index.advantage | 58 |
| abstract_inverted_index.determine | 26 |
| abstract_inverted_index.framework | 76 |
| abstract_inverted_index.instance- | 150 |
| abstract_inverted_index.relevancy | 101, 141 |
| abstract_inverted_index.translate | 64 |
| abstract_inverted_index.zero-shot | 23, 61, 79, 146 |
| abstract_inverted_index.activation | 138 |
| abstract_inverted_index.envisioned | 3 |
| abstract_inverted_index.estimation | 83 |
| abstract_inverted_index.estimators | 25 |
| abstract_inverted_index.knowledge. | 30 |
| abstract_inverted_index.promptable | 78 |
| abstract_inverted_index.real-world | 7, 163 |
| abstract_inverted_index.scenarios, | 8 |
| abstract_inverted_index.thresholds | 139 |
| abstract_inverted_index.embeddings. | 86 |
| abstract_inverted_index.estimation. | 71, 132 |
| abstract_inverted_index.experiments | 160 |
| abstract_inverted_index.investigate | 144 |
| abstract_inverted_index.situations. | 16 |
| abstract_inverted_index.suitability | 127 |
| abstract_inverted_index.Furthermore, | 153 |
| abstract_inverted_index.applications | 42 |
| abstract_inverted_index.capabilities | 62, 147 |
| abstract_inverted_index.considerable | 38 |
| abstract_inverted_index.continuously | 12 |
| abstract_inverted_index.establishing | 44 |
| abstract_inverted_index.increasingly | 2 |
| abstract_inverted_index.registration | 118 |
| abstract_inverted_index.Additionally, | 120 |
| abstract_inverted_index.understanding | 46 |
| abstract_inverted_index.reconstruction | 107 |
| abstract_inverted_index.category-level. | 152 |
| abstract_inverted_index.hyperparameters, | 135 |
| abstract_inverted_index.language-embedded | 105 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |