Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion Article Swipe

PDF

Alessandro Suglia , Qiaozi Gao , Jesse Thomason , Govind Thattai , Gaurav S. Sukhatme ·

YOU? · · 2021 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2108.04927

Language-guided robots performing home and office tasks must navigate in and\ninteract with the world. Grounding language instructions against visual\nobservations and actions to take in an environment is an open challenge. We\npresent Embodied BERT (EmBERT), a transformer-based model which can attend to\nhigh-dimensional, multi-modal inputs across long temporal horizons for\nlanguage-conditioned task completion. Additionally, we bridge the gap between\nsuccessful object-centric navigation models used for non-interactive agents and\nthe language-guided visual task completion benchmark, ALFRED, by introducing\nobject navigation targets for EmBERT training. We achieve competitive\nperformance on the ALFRED benchmark, and EmBERT marks the first\ntransformer-based model to successfully handle the long-horizon, dense,\nmulti-modal histories of ALFRED, and the first ALFRED model to utilize\nobject-centric navigation targets.\n

Related Topics

Benchmark (Surveying)

Artificial Intelligence

Human–Computer Interaction

Engineering

Systems Engineering

Electrical Engineering

Concepts

Embodied cognition Transformer Computer science Robot Task (project management) Language understanding Benchmark (surveying) Artificial intelligence Language model Bridge (graph theory) Human–computer interaction Engineering Systems engineering Electrical engineering Geodesy Geography Internal medicine Medicine Voltage

Metadata

Type: preprint
Landing Page: http://arxiv.org/abs/2108.04927
PDF: https://arxiv.org/pdf/2108.04927
OA Status: green
Cited By: 30
Related Works: 10
OpenAlex ID: https://openalex.org/W4287026640

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4287026640

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2108.04927

Digital Object Identifier
Title: Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion

Work title
Type: preprint

OpenAlex work type
Publication year: 2021

Year of publication
Publication date: 2021-08-10

Full publication date if available
Authors: Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav S. Sukhatme

List of authors in order
Landing page: https://arxiv.org/abs/2108.04927

Publisher landing page
PDF URL: https://arxiv.org/pdf/2108.04927

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2108.04927

Direct OA link when available
Concepts: Embodied cognition, Transformer, Computer science, Robot, Task (project management), Language understanding, Benchmark (surveying), Artificial intelligence, Language model, Bridge (graph theory), Human–computer interaction, Engineering, Systems engineering, Electrical engineering, Geodesy, Geography, Internal medicine, Medicine, Voltage

Top concepts (fields/topics) attached by OpenAlex
Cited by: 30

Total citation count in OpenAlex
Citations by year (recent): 2025: 2, 2024: 5, 2023: 15, 2022: 8

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4287026640
doi	https://doi.org/10.48550/arxiv.2108.04927
ids.openalex	https://openalex.org/W4287026640
fwci	2.86214541
type	preprint
title	Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T11714
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9998999834060669
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1707
topics[0].subfield.display_name	Computer Vision and Pattern Recognition
topics[0].display_name	Multimodal Machine Learning Applications
topics[1].id	https://openalex.org/T11307
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9894999861717224
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Domain Adaptation and Few-Shot Learning
topics[2].id	https://openalex.org/T10812
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9883999824523926
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1707
topics[2].subfield.display_name	Computer Vision and Pattern Recognition
topics[2].display_name	Human Pose and Action Recognition
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C100609095
concepts[0].level	2
concepts[0].score	0.8341007232666016
concepts[0].wikidata	https://www.wikidata.org/wiki/Q1335050
concepts[0].display_name	Embodied cognition
concepts[1].id	https://openalex.org/C66322947
concepts[1].level	3
concepts[1].score	0.7777326107025146
concepts[1].wikidata	https://www.wikidata.org/wiki/Q11658
concepts[1].display_name	Transformer
concepts[2].id	https://openalex.org/C41008148
concepts[2].level	0
concepts[2].score	0.7445054650306702
concepts[2].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[2].display_name	Computer science
concepts[3].id	https://openalex.org/C90509273
concepts[3].level	2
concepts[3].score	0.5331957340240479
concepts[3].wikidata	https://www.wikidata.org/wiki/Q11012
concepts[3].display_name	Robot
concepts[4].id	https://openalex.org/C2780451532
concepts[4].level	2
concepts[4].score	0.5325065851211548
concepts[4].wikidata	https://www.wikidata.org/wiki/Q759676
concepts[4].display_name	Task (project management)
concepts[5].id	https://openalex.org/C2983448237
concepts[5].level	2
concepts[5].score	0.5166759490966797
concepts[5].wikidata	https://www.wikidata.org/wiki/Q1078276
concepts[5].display_name	Language understanding
concepts[6].id	https://openalex.org/C185798385
concepts[6].level	2
concepts[6].score	0.5022485256195068
concepts[6].wikidata	https://www.wikidata.org/wiki/Q1161707
concepts[6].display_name	Benchmark (surveying)
concepts[7].id	https://openalex.org/C154945302
concepts[7].level	1
concepts[7].score	0.46088361740112305
concepts[7].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[7].display_name	Artificial intelligence
concepts[8].id	https://openalex.org/C137293760
concepts[8].level	2
concepts[8].score	0.45375606417655945
concepts[8].wikidata	https://www.wikidata.org/wiki/Q3621696
concepts[8].display_name	Language model
concepts[9].id	https://openalex.org/C100776233
concepts[9].level	2
concepts[9].score	0.43719029426574707
concepts[9].wikidata	https://www.wikidata.org/wiki/Q2532492
concepts[9].display_name	Bridge (graph theory)
concepts[10].id	https://openalex.org/C107457646
concepts[10].level	1
concepts[10].score	0.4154307246208191
concepts[10].wikidata	https://www.wikidata.org/wiki/Q207434
concepts[10].display_name	Human–computer interaction
concepts[11].id	https://openalex.org/C127413603
concepts[11].level	0
concepts[11].score	0.14612898230552673
concepts[11].wikidata	https://www.wikidata.org/wiki/Q11023
concepts[11].display_name	Engineering
concepts[12].id	https://openalex.org/C201995342
concepts[12].level	1
concepts[12].score	0.0
concepts[12].wikidata	https://www.wikidata.org/wiki/Q682496
concepts[12].display_name	Systems engineering
concepts[13].id	https://openalex.org/C119599485
concepts[13].level	1
concepts[13].score	0.0
concepts[13].wikidata	https://www.wikidata.org/wiki/Q43035
concepts[13].display_name	Electrical engineering
concepts[14].id	https://openalex.org/C13280743
concepts[14].level	1
concepts[14].score	0.0
concepts[14].wikidata	https://www.wikidata.org/wiki/Q131089
concepts[14].display_name	Geodesy
concepts[15].id	https://openalex.org/C205649164
concepts[15].level	0
concepts[15].score	0.0
concepts[15].wikidata	https://www.wikidata.org/wiki/Q1071
concepts[15].display_name	Geography
concepts[16].id	https://openalex.org/C126322002
concepts[16].level	1
concepts[16].score	0.0
concepts[16].wikidata	https://www.wikidata.org/wiki/Q11180
concepts[16].display_name	Internal medicine
concepts[17].id	https://openalex.org/C71924100
concepts[17].level	0
concepts[17].score	0.0
concepts[17].wikidata	https://www.wikidata.org/wiki/Q11190
concepts[17].display_name	Medicine
concepts[18].id	https://openalex.org/C165801399
concepts[18].level	2
concepts[18].score	0.0
concepts[18].wikidata	https://www.wikidata.org/wiki/Q25428
concepts[18].display_name	Voltage
keywords[0].id	https://openalex.org/keywords/embodied-cognition
keywords[0].score	0.8341007232666016
keywords[0].display_name	Embodied cognition
keywords[1].id	https://openalex.org/keywords/transformer
keywords[1].score	0.7777326107025146
keywords[1].display_name	Transformer
keywords[2].id	https://openalex.org/keywords/computer-science
keywords[2].score	0.7445054650306702
keywords[2].display_name	Computer science
keywords[3].id	https://openalex.org/keywords/robot
keywords[3].score	0.5331957340240479
keywords[3].display_name	Robot
keywords[4].id	https://openalex.org/keywords/task
keywords[4].score	0.5325065851211548
keywords[4].display_name	Task (project management)
keywords[5].id	https://openalex.org/keywords/language-understanding
keywords[5].score	0.5166759490966797
keywords[5].display_name	Language understanding
keywords[6].id	https://openalex.org/keywords/benchmark
keywords[6].score	0.5022485256195068
keywords[6].display_name	Benchmark (surveying)
keywords[7].id	https://openalex.org/keywords/artificial-intelligence
keywords[7].score	0.46088361740112305
keywords[7].display_name	Artificial intelligence
keywords[8].id	https://openalex.org/keywords/language-model
keywords[8].score	0.45375606417655945
keywords[8].display_name	Language model
keywords[9].id	https://openalex.org/keywords/bridge
keywords[9].score	0.43719029426574707
keywords[9].display_name	Bridge (graph theory)
keywords[10].id	https://openalex.org/keywords/human–computer-interaction
keywords[10].score	0.4154307246208191
keywords[10].display_name	Human–computer interaction
keywords[11].id	https://openalex.org/keywords/engineering
keywords[11].score	0.14612898230552673
keywords[11].display_name	Engineering
language
locations[0].id	pmh:oai:arXiv.org:2108.04927
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2108.04927
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2108.04927
indexed_in	arxiv
authorships[0].author.id	https://openalex.org/A5010504829
authorships[0].author.orcid	https://orcid.org/0000-0002-3177-5197
authorships[0].author.display_name	Alessandro Suglia
authorships[0].author_position	first
authorships[0].raw_author_name	Suglia, Alessandro
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5077622605
authorships[1].author.orcid	https://orcid.org/0000-0002-5403-0796
authorships[1].author.display_name	Qiaozi Gao
authorships[1].author_position	middle
authorships[1].raw_author_name	Gao, Qiaozi
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5108062941
authorships[2].author.orcid	https://orcid.org/0000-0001-9199-0633
authorships[2].author.display_name	Jesse Thomason
authorships[2].author_position	middle
authorships[2].raw_author_name	Thomason, Jesse
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5088771920
authorships[3].author.orcid	https://orcid.org/0009-0005-1010-8896
authorships[3].author.display_name	Govind Thattai
authorships[3].author_position	middle
authorships[3].raw_author_name	Thattai, Govind
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5077367921
authorships[4].author.orcid	https://orcid.org/0000-0003-2408-474X
authorships[4].author.display_name	Gaurav S. Sukhatme
authorships[4].author_position	last
authorships[4].raw_author_name	Sukhatme, Gaurav
authorships[4].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2108.04927
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2022-07-25T00:00:00
display_name	Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T03:46:38.306776
primary_topic.id	https://openalex.org/T11714
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9998999834060669
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1707
primary_topic.subfield.display_name	Computer Vision and Pattern Recognition
primary_topic.display_name	Multimodal Machine Learning Applications
related_works	https://openalex.org/W3013624417, https://openalex.org/W4287826556, https://openalex.org/W3098382480, https://openalex.org/W4287598411, https://openalex.org/W3094871513, https://openalex.org/W3100913109, https://openalex.org/W3198458223, https://openalex.org/W4288365749, https://openalex.org/W2936497627, https://openalex.org/W2964413124
cited_by_count	30
counts_by_year[0].year	2025
counts_by_year[0].cited_by_count	2
counts_by_year[1].year	2024
counts_by_year[1].cited_by_count	5
counts_by_year[2].year	2023
counts_by_year[2].cited_by_count	15
counts_by_year[3].year	2022
counts_by_year[3].cited_by_count	8
locations_count	1
best_oa_location.id	pmh:oai:arXiv.org:2108.04927
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2108.04927
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2108.04927
primary_location.id	pmh:oai:arXiv.org:2108.04927
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2108.04927
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2108.04927
publication_date	2021-08-10
publication_year	2021
referenced_works_count	0
abstract_inverted_index.a	34
abstract_inverted_index.We	77
abstract_inverted_index.an	24, 27
abstract_inverted_index.by	70
abstract_inverted_index.in	9, 23
abstract_inverted_index.is	26
abstract_inverted_index.of	97
abstract_inverted_index.on	80
abstract_inverted_index.to	21, 90, 104
abstract_inverted_index.we	51
abstract_inverted_index.and	4, 19, 84, 99
abstract_inverted_index.can	38
abstract_inverted_index.for	60, 74
abstract_inverted_index.gap	54
abstract_inverted_index.the	12, 53, 81, 87, 93, 100
abstract_inverted_index.BERT	32
abstract_inverted_index.home	3
abstract_inverted_index.long	44
abstract_inverted_index.must	7
abstract_inverted_index.open	28
abstract_inverted_index.take	22
abstract_inverted_index.task	48, 66
abstract_inverted_index.used	59
abstract_inverted_index.with	11
abstract_inverted_index.first	101
abstract_inverted_index.marks	86
abstract_inverted_index.model	36, 89, 103
abstract_inverted_index.tasks	6
abstract_inverted_index.which	37
abstract_inverted_index.ALFRED	82, 102
abstract_inverted_index.EmBERT	75, 85
abstract_inverted_index.across	43
abstract_inverted_index.agents	62
abstract_inverted_index.attend	39
abstract_inverted_index.bridge	52
abstract_inverted_index.handle	92
abstract_inverted_index.inputs	42
abstract_inverted_index.models	58
abstract_inverted_index.office	5
abstract_inverted_index.robots	1
abstract_inverted_index.visual	65
abstract_inverted_index.world.	13
abstract_inverted_index.ALFRED,	69, 98
abstract_inverted_index.achieve	78
abstract_inverted_index.actions	20
abstract_inverted_index.against	17
abstract_inverted_index.targets	73
abstract_inverted_index.Embodied	31
abstract_inverted_index.and\nthe	63
abstract_inverted_index.horizons	46
abstract_inverted_index.language	15
abstract_inverted_index.navigate	8
abstract_inverted_index.temporal	45
abstract_inverted_index.(EmBERT),	33
abstract_inverted_index.Grounding	14
abstract_inverted_index.histories	96
abstract_inverted_index.training.	76
abstract_inverted_index.benchmark,	68, 83
abstract_inverted_index.challenge.	29
abstract_inverted_index.completion	67
abstract_inverted_index.navigation	57, 72, 106
abstract_inverted_index.performing	2
abstract_inverted_index.targets.\n	107
abstract_inverted_index.We\npresent	30
abstract_inverted_index.completion.	49
abstract_inverted_index.environment	25
abstract_inverted_index.multi-modal	41
abstract_inverted_index.instructions	16
abstract_inverted_index.successfully	91
abstract_inverted_index.Additionally,	50
abstract_inverted_index.and\ninteract	10
abstract_inverted_index.long-horizon,	94
abstract_inverted_index.object-centric	56
abstract_inverted_index.Language-guided	0
abstract_inverted_index.language-guided	64
abstract_inverted_index.non-interactive	61
abstract_inverted_index.transformer-based	35
abstract_inverted_index.between\nsuccessful	55
abstract_inverted_index.dense,\nmulti-modal	95
abstract_inverted_index.introducing\nobject	71
abstract_inverted_index.visual\nobservations	18
abstract_inverted_index.to\nhigh-dimensional,	40
abstract_inverted_index.utilize\nobject-centric	105
abstract_inverted_index.competitive\nperformance	79
abstract_inverted_index.first\ntransformer-based	88
abstract_inverted_index.for\nlanguage-conditioned	47
cited_by_percentile_year.max	99
cited_by_percentile_year.min	95
countries_distinct_count	0
institutions_distinct_count	5
sustainable_development_goals[0].id	https://metadata.un.org/sdg/4
sustainable_development_goals[0].score	0.5899999737739563
sustainable_development_goals[0].display_name	Quality Education
citation_normalized_percentile.value	0.92328528
citation_normalized_percentile.is_in_top_1_percent	False
citation_normalized_percentile.is_in_top_10_percent	True