Grounding Video Models to Actions through Goal Conditioned Exploration Article Swipe

PDF

Yuling Luo , Yilun Du ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2411.07223

Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

Related Topics

Computer Science

Concepts

Computer science Cognitive science Psychology

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2411.07223
PDF: https://arxiv.org/pdf/2411.07223
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4404391971

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4404391971

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2411.07223

Digital Object Identifier
Title: Grounding Video Models to Actions through Goal Conditioned Exploration

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-11-11

Full publication date if available
Authors: Yuling Luo, Yilun Du

List of authors in order
Landing page: https://arxiv.org/abs/2411.07223

Publisher landing page
PDF URL: https://arxiv.org/pdf/2411.07223

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2411.07223

Direct OA link when available
Concepts: Computer science, Cognitive science, Psychology

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4404391971
doi	https://doi.org/10.48550/arxiv.2411.07223
ids.doi	https://doi.org/10.48550/arxiv.2411.07223
ids.openalex	https://openalex.org/W4404391971
fwci
type	preprint
title	Grounding Video Models to Actions through Goal Conditioned Exploration
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10731
topics[0].field.id	https://openalex.org/fields/32
topics[0].field.display_name	Psychology
topics[0].score	0.6140999794006348
topics[0].domain.id	https://openalex.org/domains/2
topics[0].domain.display_name	Social Sciences
topics[0].subfield.id	https://openalex.org/subfields/3204
topics[0].subfield.display_name	Developmental and Educational Psychology
topics[0].display_name	Educational Games and Gamification
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.46702054142951965
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C188147891
concepts[1].level	1
concepts[1].score	0.3584008812904358
concepts[1].wikidata	https://www.wikidata.org/wiki/Q147638
concepts[1].display_name	Cognitive science
concepts[2].id	https://openalex.org/C15744967
concepts[2].level	0
concepts[2].score	0.2472183108329773
concepts[2].wikidata	https://www.wikidata.org/wiki/Q9418
concepts[2].display_name	Psychology
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.46702054142951965
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/cognitive-science
keywords[1].score	0.3584008812904358
keywords[1].display_name	Cognitive science
keywords[2].id	https://openalex.org/keywords/psychology
keywords[2].score	0.2472183108329773
keywords[2].display_name	Psychology
language	en
locations[0].id	pmh:oai:arXiv.org:2411.07223
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2411.07223
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2411.07223
locations[1].id	doi:10.48550/arxiv.2411.07223
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2411.07223
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5112127033
authorships[0].author.orcid	https://orcid.org/0009-0000-3299-4853
authorships[0].author.display_name	Yuling Luo
authorships[0].author_position	first
authorships[0].raw_author_name	Luo, Yunhao
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5101182304
authorships[1].author.orcid
authorships[1].author.display_name	Yilun Du
authorships[1].author_position	last
authorships[1].raw_author_name	Du, Yilun
authorships[1].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2411.07223
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Grounding Video Models to Actions through Goal Conditioned Exploration
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10731
primary_topic.field.id	https://openalex.org/fields/32
primary_topic.field.display_name	Psychology
primary_topic.score	0.6140999794006348
primary_topic.domain.id	https://openalex.org/domains/2
primary_topic.domain.display_name	Social Sciences
primary_topic.subfield.id	https://openalex.org/subfields/3204
primary_topic.subfield.display_name	Developmental and Educational Psychology
primary_topic.display_name	Educational Games and Gamification
related_works	https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2411.07223
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2411.07223
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2411.07223
primary_location.id	pmh:oai:arXiv.org:2411.07223
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2411.07223
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2411.07223
publication_date	2024-11-11
publication_year	2024
referenced_works_count	0
abstract_inverted_index.4	186
abstract_inverted_index.6	182
abstract_inverted_index.8	178
abstract_inverted_index.a	11, 54, 63, 84, 140
abstract_inverted_index.--	128
abstract_inverted_index.12	191
abstract_inverted_index.In	108
abstract_inverted_index.To	56
abstract_inverted_index.We	138, 172, 197
abstract_inverted_index.an	36, 155
abstract_inverted_index.as	133
abstract_inverted_index.do	39
abstract_inverted_index.in	32, 53, 103, 124, 148, 180, 184, 188, 193
abstract_inverted_index.is	86, 94, 202
abstract_inverted_index.of	7, 14, 22, 35
abstract_inverted_index.on	4, 70, 177, 203, 214
abstract_inverted_index.or	169, 206
abstract_inverted_index.to	43, 47, 73, 77, 81, 96, 100, 114, 119, 153, 157
abstract_inverted_index.we	111
abstract_inverted_index.and	20, 24, 38, 89, 91, 190
abstract_inverted_index.any	162, 220
abstract_inverted_index.are	29, 106
abstract_inverted_index.for	136
abstract_inverted_index.how	42, 113, 199
abstract_inverted_index.map	74
abstract_inverted_index.not	30, 40
abstract_inverted_index.our	200
abstract_inverted_index.par	204
abstract_inverted_index.the	18, 33, 45, 49, 101, 125, 174
abstract_inverted_index.use	62
abstract_inverted_index.data	72, 80, 105
abstract_inverted_index.even	207
abstract_inverted_index.ones	102
abstract_inverted_index.rich	12
abstract_inverted_index.show	198
abstract_inverted_index.such	83
abstract_inverted_index.that	142
abstract_inverted_index.this	58, 92, 109
abstract_inverted_index.uses	143
abstract_inverted_index.with	150, 205
abstract_inverted_index.Large	0
abstract_inverted_index.about	17
abstract_inverted_index.agent	156
abstract_inverted_index.e.g.,	165
abstract_inverted_index.goals	135
abstract_inverted_index.iThor	194
abstract_inverted_index.image	75
abstract_inverted_index.level	145
abstract_inverted_index.model	68, 85, 93
abstract_inverted_index.often	87
abstract_inverted_index.reach	48
abstract_inverted_index.solve	158
abstract_inverted_index.tasks	160, 179, 183, 187, 192
abstract_inverted_index.train	82
abstract_inverted_index.using	129
abstract_inverted_index.video	1, 27, 117, 131, 151
abstract_inverted_index.which	104
abstract_inverted_index.while	217
abstract_inverted_index.world	46
abstract_inverted_index.Visual	195
abstract_inverted_index.action	146, 167, 221
abstract_inverted_index.agent,	37
abstract_inverted_index.enable	154
abstract_inverted_index.expert	215
abstract_inverted_index.ground	116
abstract_inverted_index.masks.	171
abstract_inverted_index.models	28, 118
abstract_inverted_index.paper,	110
abstract_inverted_index.source	13
abstract_inverted_index.states	51, 76, 132
abstract_inverted_index.tackle	57
abstract_inverted_index.tasks.	25
abstract_inverted_index.video,	9
abstract_inverted_index.video.	55
abstract_inverted_index.visual	50, 97, 134
abstract_inverted_index.Calvin,	189
abstract_inverted_index.Libero,	181
abstract_inverted_index.actions	121
abstract_inverted_index.actuate	44
abstract_inverted_index.amounts	6
abstract_inverted_index.cloning	211
abstract_inverted_index.complex	159
abstract_inverted_index.current	60
abstract_inverted_index.dynamic	67
abstract_inverted_index.inverse	66
abstract_inverted_index.labels,	168
abstract_inverted_index.limited	95
abstract_inverted_index.massive	5
abstract_inverted_index.methods	61
abstract_inverted_index.models,	2
abstract_inverted_index.motions	21
abstract_inverted_index.objects	23
abstract_inverted_index.propose	139
abstract_inverted_index.provide	10
abstract_inverted_index.similar	99
abstract_inverted_index.through	122
abstract_inverted_index.trained	69, 213
abstract_inverted_index.without	161, 218
abstract_inverted_index.However,	26
abstract_inverted_index.Internet	8
abstract_inverted_index.actions.	78
abstract_inverted_index.approach	176, 201
abstract_inverted_index.behavior	210
abstract_inverted_index.depicted	52
abstract_inverted_index.describe	41
abstract_inverted_index.directly	115
abstract_inverted_index.dynamics	19
abstract_inverted_index.embodied	126
abstract_inverted_index.external	163
abstract_inverted_index.grounded	31
abstract_inverted_index.guidance	152
abstract_inverted_index.multiple	209
abstract_inverted_index.physical	15
abstract_inverted_index.problem,	59
abstract_inverted_index.proposed	175
abstract_inverted_index.rewards,	166
abstract_inverted_index.separate	64
abstract_inverted_index.settings	98
abstract_inverted_index.validate	173
abstract_inverted_index.Gathering	79
abstract_inverted_index.baselines	212
abstract_inverted_index.expensive	88
abstract_inverted_index.framework	141
abstract_inverted_index.generated	130
abstract_inverted_index.knowledge	16
abstract_inverted_index.requiring	219
abstract_inverted_index.surpasses	208
abstract_inverted_index.MetaWorld,	185
abstract_inverted_index.available.	107
abstract_inverted_index.continuous	120
abstract_inverted_index.embodiment	34
abstract_inverted_index.generation	147
abstract_inverted_index.pretrained	3
abstract_inverted_index.trajectory	144
abstract_inverted_index.Navigation.	196
abstract_inverted_index.combination	149
abstract_inverted_index.environment	127
abstract_inverted_index.investigate	112
abstract_inverted_index.annotations.	222
abstract_inverted_index.challenging,	90
abstract_inverted_index.exploration.	137
abstract_inverted_index.segmentation	170
abstract_inverted_index.supervision,	164
abstract_inverted_index.vision-based	65
abstract_inverted_index.demonstrations	216
abstract_inverted_index.self-exploration	123
abstract_inverted_index.embodiment-specific	71
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	2
citation_normalized_percentile