Grounding Video Models to Actions through Goal Conditioned Exploration Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2411.07223
Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2411.07223
- https://arxiv.org/pdf/2411.07223
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4404391971
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4404391971Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2411.07223Digital Object Identifier
- Title
-
Grounding Video Models to Actions through Goal Conditioned ExplorationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-11-11Full publication date if available
- Authors
-
Yuling Luo, Yilun DuList of authors in order
- Landing page
-
https://arxiv.org/abs/2411.07223Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2411.07223Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2411.07223Direct OA link when available
- Concepts
-
Computer science, Cognitive science, PsychologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4404391971 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2411.07223 |
| ids.doi | https://doi.org/10.48550/arxiv.2411.07223 |
| ids.openalex | https://openalex.org/W4404391971 |
| fwci | |
| type | preprint |
| title | Grounding Video Models to Actions through Goal Conditioned Exploration |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10731 |
| topics[0].field.id | https://openalex.org/fields/32 |
| topics[0].field.display_name | Psychology |
| topics[0].score | 0.6140999794006348 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/3204 |
| topics[0].subfield.display_name | Developmental and Educational Psychology |
| topics[0].display_name | Educational Games and Gamification |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.46702054142951965 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C188147891 |
| concepts[1].level | 1 |
| concepts[1].score | 0.3584008812904358 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q147638 |
| concepts[1].display_name | Cognitive science |
| concepts[2].id | https://openalex.org/C15744967 |
| concepts[2].level | 0 |
| concepts[2].score | 0.2472183108329773 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[2].display_name | Psychology |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.46702054142951965 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/cognitive-science |
| keywords[1].score | 0.3584008812904358 |
| keywords[1].display_name | Cognitive science |
| keywords[2].id | https://openalex.org/keywords/psychology |
| keywords[2].score | 0.2472183108329773 |
| keywords[2].display_name | Psychology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2411.07223 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2411.07223 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2411.07223 |
| locations[1].id | doi:10.48550/arxiv.2411.07223 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2411.07223 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5112127033 |
| authorships[0].author.orcid | https://orcid.org/0009-0000-3299-4853 |
| authorships[0].author.display_name | Yuling Luo |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Luo, Yunhao |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5101182304 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Yilun Du |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Du, Yilun |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2411.07223 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Grounding Video Models to Actions through Goal Conditioned Exploration |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10731 |
| primary_topic.field.id | https://openalex.org/fields/32 |
| primary_topic.field.display_name | Psychology |
| primary_topic.score | 0.6140999794006348 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/3204 |
| primary_topic.subfield.display_name | Developmental and Educational Psychology |
| primary_topic.display_name | Educational Games and Gamification |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2411.07223 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2411.07223 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2411.07223 |
| primary_location.id | pmh:oai:arXiv.org:2411.07223 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2411.07223 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2411.07223 |
| publication_date | 2024-11-11 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.4 | 186 |
| abstract_inverted_index.6 | 182 |
| abstract_inverted_index.8 | 178 |
| abstract_inverted_index.a | 11, 54, 63, 84, 140 |
| abstract_inverted_index.-- | 128 |
| abstract_inverted_index.12 | 191 |
| abstract_inverted_index.In | 108 |
| abstract_inverted_index.To | 56 |
| abstract_inverted_index.We | 138, 172, 197 |
| abstract_inverted_index.an | 36, 155 |
| abstract_inverted_index.as | 133 |
| abstract_inverted_index.do | 39 |
| abstract_inverted_index.in | 32, 53, 103, 124, 148, 180, 184, 188, 193 |
| abstract_inverted_index.is | 86, 94, 202 |
| abstract_inverted_index.of | 7, 14, 22, 35 |
| abstract_inverted_index.on | 4, 70, 177, 203, 214 |
| abstract_inverted_index.or | 169, 206 |
| abstract_inverted_index.to | 43, 47, 73, 77, 81, 96, 100, 114, 119, 153, 157 |
| abstract_inverted_index.we | 111 |
| abstract_inverted_index.and | 20, 24, 38, 89, 91, 190 |
| abstract_inverted_index.any | 162, 220 |
| abstract_inverted_index.are | 29, 106 |
| abstract_inverted_index.for | 136 |
| abstract_inverted_index.how | 42, 113, 199 |
| abstract_inverted_index.map | 74 |
| abstract_inverted_index.not | 30, 40 |
| abstract_inverted_index.our | 200 |
| abstract_inverted_index.par | 204 |
| abstract_inverted_index.the | 18, 33, 45, 49, 101, 125, 174 |
| abstract_inverted_index.use | 62 |
| abstract_inverted_index.data | 72, 80, 105 |
| abstract_inverted_index.even | 207 |
| abstract_inverted_index.ones | 102 |
| abstract_inverted_index.rich | 12 |
| abstract_inverted_index.show | 198 |
| abstract_inverted_index.such | 83 |
| abstract_inverted_index.that | 142 |
| abstract_inverted_index.this | 58, 92, 109 |
| abstract_inverted_index.uses | 143 |
| abstract_inverted_index.with | 150, 205 |
| abstract_inverted_index.Large | 0 |
| abstract_inverted_index.about | 17 |
| abstract_inverted_index.agent | 156 |
| abstract_inverted_index.e.g., | 165 |
| abstract_inverted_index.goals | 135 |
| abstract_inverted_index.iThor | 194 |
| abstract_inverted_index.image | 75 |
| abstract_inverted_index.level | 145 |
| abstract_inverted_index.model | 68, 85, 93 |
| abstract_inverted_index.often | 87 |
| abstract_inverted_index.reach | 48 |
| abstract_inverted_index.solve | 158 |
| abstract_inverted_index.tasks | 160, 179, 183, 187, 192 |
| abstract_inverted_index.train | 82 |
| abstract_inverted_index.using | 129 |
| abstract_inverted_index.video | 1, 27, 117, 131, 151 |
| abstract_inverted_index.which | 104 |
| abstract_inverted_index.while | 217 |
| abstract_inverted_index.world | 46 |
| abstract_inverted_index.Visual | 195 |
| abstract_inverted_index.action | 146, 167, 221 |
| abstract_inverted_index.agent, | 37 |
| abstract_inverted_index.enable | 154 |
| abstract_inverted_index.expert | 215 |
| abstract_inverted_index.ground | 116 |
| abstract_inverted_index.masks. | 171 |
| abstract_inverted_index.models | 28, 118 |
| abstract_inverted_index.paper, | 110 |
| abstract_inverted_index.source | 13 |
| abstract_inverted_index.states | 51, 76, 132 |
| abstract_inverted_index.tackle | 57 |
| abstract_inverted_index.tasks. | 25 |
| abstract_inverted_index.video, | 9 |
| abstract_inverted_index.video. | 55 |
| abstract_inverted_index.visual | 50, 97, 134 |
| abstract_inverted_index.Calvin, | 189 |
| abstract_inverted_index.Libero, | 181 |
| abstract_inverted_index.actions | 121 |
| abstract_inverted_index.actuate | 44 |
| abstract_inverted_index.amounts | 6 |
| abstract_inverted_index.cloning | 211 |
| abstract_inverted_index.complex | 159 |
| abstract_inverted_index.current | 60 |
| abstract_inverted_index.dynamic | 67 |
| abstract_inverted_index.inverse | 66 |
| abstract_inverted_index.labels, | 168 |
| abstract_inverted_index.limited | 95 |
| abstract_inverted_index.massive | 5 |
| abstract_inverted_index.methods | 61 |
| abstract_inverted_index.models, | 2 |
| abstract_inverted_index.motions | 21 |
| abstract_inverted_index.objects | 23 |
| abstract_inverted_index.propose | 139 |
| abstract_inverted_index.provide | 10 |
| abstract_inverted_index.similar | 99 |
| abstract_inverted_index.through | 122 |
| abstract_inverted_index.trained | 69, 213 |
| abstract_inverted_index.without | 161, 218 |
| abstract_inverted_index.However, | 26 |
| abstract_inverted_index.Internet | 8 |
| abstract_inverted_index.actions. | 78 |
| abstract_inverted_index.approach | 176, 201 |
| abstract_inverted_index.behavior | 210 |
| abstract_inverted_index.depicted | 52 |
| abstract_inverted_index.describe | 41 |
| abstract_inverted_index.directly | 115 |
| abstract_inverted_index.dynamics | 19 |
| abstract_inverted_index.embodied | 126 |
| abstract_inverted_index.external | 163 |
| abstract_inverted_index.grounded | 31 |
| abstract_inverted_index.guidance | 152 |
| abstract_inverted_index.multiple | 209 |
| abstract_inverted_index.physical | 15 |
| abstract_inverted_index.problem, | 59 |
| abstract_inverted_index.proposed | 175 |
| abstract_inverted_index.rewards, | 166 |
| abstract_inverted_index.separate | 64 |
| abstract_inverted_index.settings | 98 |
| abstract_inverted_index.validate | 173 |
| abstract_inverted_index.Gathering | 79 |
| abstract_inverted_index.baselines | 212 |
| abstract_inverted_index.expensive | 88 |
| abstract_inverted_index.framework | 141 |
| abstract_inverted_index.generated | 130 |
| abstract_inverted_index.knowledge | 16 |
| abstract_inverted_index.requiring | 219 |
| abstract_inverted_index.surpasses | 208 |
| abstract_inverted_index.MetaWorld, | 185 |
| abstract_inverted_index.available. | 107 |
| abstract_inverted_index.continuous | 120 |
| abstract_inverted_index.embodiment | 34 |
| abstract_inverted_index.generation | 147 |
| abstract_inverted_index.pretrained | 3 |
| abstract_inverted_index.trajectory | 144 |
| abstract_inverted_index.Navigation. | 196 |
| abstract_inverted_index.combination | 149 |
| abstract_inverted_index.environment | 127 |
| abstract_inverted_index.investigate | 112 |
| abstract_inverted_index.annotations. | 222 |
| abstract_inverted_index.challenging, | 90 |
| abstract_inverted_index.exploration. | 137 |
| abstract_inverted_index.segmentation | 170 |
| abstract_inverted_index.supervision, | 164 |
| abstract_inverted_index.vision-based | 65 |
| abstract_inverted_index.demonstrations | 216 |
| abstract_inverted_index.self-exploration | 123 |
| abstract_inverted_index.embodiment-specific | 71 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile |