Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion Article Swipe
YOU?
·
· 2021
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2108.04927
Language-guided robots performing home and office tasks must navigate in and\ninteract with the world. Grounding language instructions against visual\nobservations and actions to take in an environment is an open challenge. We\npresent Embodied BERT (EmBERT), a transformer-based model which can attend to\nhigh-dimensional, multi-modal inputs across long temporal horizons for\nlanguage-conditioned task completion. Additionally, we bridge the gap between\nsuccessful object-centric navigation models used for non-interactive agents and\nthe language-guided visual task completion benchmark, ALFRED, by introducing\nobject navigation targets for EmBERT training. We achieve competitive\nperformance on the ALFRED benchmark, and EmBERT marks the first\ntransformer-based model to successfully handle the long-horizon, dense,\nmulti-modal histories of ALFRED, and the first ALFRED model to utilize\nobject-centric navigation targets.\n
Related Topics
- Type
- preprint
- Landing Page
- http://arxiv.org/abs/2108.04927
- https://arxiv.org/pdf/2108.04927
- OA Status
- green
- Cited By
- 30
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4287026640
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4287026640Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2108.04927Digital Object Identifier
- Title
-
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task CompletionWork title
- Type
-
preprintOpenAlex work type
- Publication year
-
2021Year of publication
- Publication date
-
2021-08-10Full publication date if available
- Authors
-
Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav S. SukhatmeList of authors in order
- Landing page
-
https://arxiv.org/abs/2108.04927Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2108.04927Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2108.04927Direct OA link when available
- Concepts
-
Embodied cognition, Transformer, Computer science, Robot, Task (project management), Language understanding, Benchmark (surveying), Artificial intelligence, Language model, Bridge (graph theory), Human–computer interaction, Engineering, Systems engineering, Electrical engineering, Geodesy, Geography, Internal medicine, Medicine, VoltageTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
30Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 2, 2024: 5, 2023: 15, 2022: 8Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4287026640 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2108.04927 |
| ids.openalex | https://openalex.org/W4287026640 |
| fwci | 2.86214541 |
| type | preprint |
| title | Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9998999834060669 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T11307 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9894999861717224 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Domain Adaptation and Few-Shot Learning |
| topics[2].id | https://openalex.org/T10812 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9883999824523926 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Human Pose and Action Recognition |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C100609095 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8341007232666016 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1335050 |
| concepts[0].display_name | Embodied cognition |
| concepts[1].id | https://openalex.org/C66322947 |
| concepts[1].level | 3 |
| concepts[1].score | 0.7777326107025146 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q11658 |
| concepts[1].display_name | Transformer |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.7445054650306702 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C90509273 |
| concepts[3].level | 2 |
| concepts[3].score | 0.5331957340240479 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11012 |
| concepts[3].display_name | Robot |
| concepts[4].id | https://openalex.org/C2780451532 |
| concepts[4].level | 2 |
| concepts[4].score | 0.5325065851211548 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q759676 |
| concepts[4].display_name | Task (project management) |
| concepts[5].id | https://openalex.org/C2983448237 |
| concepts[5].level | 2 |
| concepts[5].score | 0.5166759490966797 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q1078276 |
| concepts[5].display_name | Language understanding |
| concepts[6].id | https://openalex.org/C185798385 |
| concepts[6].level | 2 |
| concepts[6].score | 0.5022485256195068 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q1161707 |
| concepts[6].display_name | Benchmark (surveying) |
| concepts[7].id | https://openalex.org/C154945302 |
| concepts[7].level | 1 |
| concepts[7].score | 0.46088361740112305 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[7].display_name | Artificial intelligence |
| concepts[8].id | https://openalex.org/C137293760 |
| concepts[8].level | 2 |
| concepts[8].score | 0.45375606417655945 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q3621696 |
| concepts[8].display_name | Language model |
| concepts[9].id | https://openalex.org/C100776233 |
| concepts[9].level | 2 |
| concepts[9].score | 0.43719029426574707 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q2532492 |
| concepts[9].display_name | Bridge (graph theory) |
| concepts[10].id | https://openalex.org/C107457646 |
| concepts[10].level | 1 |
| concepts[10].score | 0.4154307246208191 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q207434 |
| concepts[10].display_name | Human–computer interaction |
| concepts[11].id | https://openalex.org/C127413603 |
| concepts[11].level | 0 |
| concepts[11].score | 0.14612898230552673 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[11].display_name | Engineering |
| concepts[12].id | https://openalex.org/C201995342 |
| concepts[12].level | 1 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q682496 |
| concepts[12].display_name | Systems engineering |
| concepts[13].id | https://openalex.org/C119599485 |
| concepts[13].level | 1 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q43035 |
| concepts[13].display_name | Electrical engineering |
| concepts[14].id | https://openalex.org/C13280743 |
| concepts[14].level | 1 |
| concepts[14].score | 0.0 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q131089 |
| concepts[14].display_name | Geodesy |
| concepts[15].id | https://openalex.org/C205649164 |
| concepts[15].level | 0 |
| concepts[15].score | 0.0 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[15].display_name | Geography |
| concepts[16].id | https://openalex.org/C126322002 |
| concepts[16].level | 1 |
| concepts[16].score | 0.0 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q11180 |
| concepts[16].display_name | Internal medicine |
| concepts[17].id | https://openalex.org/C71924100 |
| concepts[17].level | 0 |
| concepts[17].score | 0.0 |
| concepts[17].wikidata | https://www.wikidata.org/wiki/Q11190 |
| concepts[17].display_name | Medicine |
| concepts[18].id | https://openalex.org/C165801399 |
| concepts[18].level | 2 |
| concepts[18].score | 0.0 |
| concepts[18].wikidata | https://www.wikidata.org/wiki/Q25428 |
| concepts[18].display_name | Voltage |
| keywords[0].id | https://openalex.org/keywords/embodied-cognition |
| keywords[0].score | 0.8341007232666016 |
| keywords[0].display_name | Embodied cognition |
| keywords[1].id | https://openalex.org/keywords/transformer |
| keywords[1].score | 0.7777326107025146 |
| keywords[1].display_name | Transformer |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.7445054650306702 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/robot |
| keywords[3].score | 0.5331957340240479 |
| keywords[3].display_name | Robot |
| keywords[4].id | https://openalex.org/keywords/task |
| keywords[4].score | 0.5325065851211548 |
| keywords[4].display_name | Task (project management) |
| keywords[5].id | https://openalex.org/keywords/language-understanding |
| keywords[5].score | 0.5166759490966797 |
| keywords[5].display_name | Language understanding |
| keywords[6].id | https://openalex.org/keywords/benchmark |
| keywords[6].score | 0.5022485256195068 |
| keywords[6].display_name | Benchmark (surveying) |
| keywords[7].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[7].score | 0.46088361740112305 |
| keywords[7].display_name | Artificial intelligence |
| keywords[8].id | https://openalex.org/keywords/language-model |
| keywords[8].score | 0.45375606417655945 |
| keywords[8].display_name | Language model |
| keywords[9].id | https://openalex.org/keywords/bridge |
| keywords[9].score | 0.43719029426574707 |
| keywords[9].display_name | Bridge (graph theory) |
| keywords[10].id | https://openalex.org/keywords/human–computer-interaction |
| keywords[10].score | 0.4154307246208191 |
| keywords[10].display_name | Human–computer interaction |
| keywords[11].id | https://openalex.org/keywords/engineering |
| keywords[11].score | 0.14612898230552673 |
| keywords[11].display_name | Engineering |
| language | |
| locations[0].id | pmh:oai:arXiv.org:2108.04927 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2108.04927 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2108.04927 |
| indexed_in | arxiv |
| authorships[0].author.id | https://openalex.org/A5010504829 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-3177-5197 |
| authorships[0].author.display_name | Alessandro Suglia |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Suglia, Alessandro |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5077622605 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-5403-0796 |
| authorships[1].author.display_name | Qiaozi Gao |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Gao, Qiaozi |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5108062941 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-9199-0633 |
| authorships[2].author.display_name | Jesse Thomason |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Thomason, Jesse |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5088771920 |
| authorships[3].author.orcid | https://orcid.org/0009-0005-1010-8896 |
| authorships[3].author.display_name | Govind Thattai |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Thattai, Govind |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5077367921 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-2408-474X |
| authorships[4].author.display_name | Gaurav S. Sukhatme |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Sukhatme, Gaurav |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2108.04927 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2022-07-25T00:00:00 |
| display_name | Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9998999834060669 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W3013624417, https://openalex.org/W4287826556, https://openalex.org/W3098382480, https://openalex.org/W4287598411, https://openalex.org/W3094871513, https://openalex.org/W3100913109, https://openalex.org/W3198458223, https://openalex.org/W4288365749, https://openalex.org/W2936497627, https://openalex.org/W2964413124 |
| cited_by_count | 30 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 2 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 5 |
| counts_by_year[2].year | 2023 |
| counts_by_year[2].cited_by_count | 15 |
| counts_by_year[3].year | 2022 |
| counts_by_year[3].cited_by_count | 8 |
| locations_count | 1 |
| best_oa_location.id | pmh:oai:arXiv.org:2108.04927 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2108.04927 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2108.04927 |
| primary_location.id | pmh:oai:arXiv.org:2108.04927 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2108.04927 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2108.04927 |
| publication_date | 2021-08-10 |
| publication_year | 2021 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 34 |
| abstract_inverted_index.We | 77 |
| abstract_inverted_index.an | 24, 27 |
| abstract_inverted_index.by | 70 |
| abstract_inverted_index.in | 9, 23 |
| abstract_inverted_index.is | 26 |
| abstract_inverted_index.of | 97 |
| abstract_inverted_index.on | 80 |
| abstract_inverted_index.to | 21, 90, 104 |
| abstract_inverted_index.we | 51 |
| abstract_inverted_index.and | 4, 19, 84, 99 |
| abstract_inverted_index.can | 38 |
| abstract_inverted_index.for | 60, 74 |
| abstract_inverted_index.gap | 54 |
| abstract_inverted_index.the | 12, 53, 81, 87, 93, 100 |
| abstract_inverted_index.BERT | 32 |
| abstract_inverted_index.home | 3 |
| abstract_inverted_index.long | 44 |
| abstract_inverted_index.must | 7 |
| abstract_inverted_index.open | 28 |
| abstract_inverted_index.take | 22 |
| abstract_inverted_index.task | 48, 66 |
| abstract_inverted_index.used | 59 |
| abstract_inverted_index.with | 11 |
| abstract_inverted_index.first | 101 |
| abstract_inverted_index.marks | 86 |
| abstract_inverted_index.model | 36, 89, 103 |
| abstract_inverted_index.tasks | 6 |
| abstract_inverted_index.which | 37 |
| abstract_inverted_index.ALFRED | 82, 102 |
| abstract_inverted_index.EmBERT | 75, 85 |
| abstract_inverted_index.across | 43 |
| abstract_inverted_index.agents | 62 |
| abstract_inverted_index.attend | 39 |
| abstract_inverted_index.bridge | 52 |
| abstract_inverted_index.handle | 92 |
| abstract_inverted_index.inputs | 42 |
| abstract_inverted_index.models | 58 |
| abstract_inverted_index.office | 5 |
| abstract_inverted_index.robots | 1 |
| abstract_inverted_index.visual | 65 |
| abstract_inverted_index.world. | 13 |
| abstract_inverted_index.ALFRED, | 69, 98 |
| abstract_inverted_index.achieve | 78 |
| abstract_inverted_index.actions | 20 |
| abstract_inverted_index.against | 17 |
| abstract_inverted_index.targets | 73 |
| abstract_inverted_index.Embodied | 31 |
| abstract_inverted_index.and\nthe | 63 |
| abstract_inverted_index.horizons | 46 |
| abstract_inverted_index.language | 15 |
| abstract_inverted_index.navigate | 8 |
| abstract_inverted_index.temporal | 45 |
| abstract_inverted_index.(EmBERT), | 33 |
| abstract_inverted_index.Grounding | 14 |
| abstract_inverted_index.histories | 96 |
| abstract_inverted_index.training. | 76 |
| abstract_inverted_index.benchmark, | 68, 83 |
| abstract_inverted_index.challenge. | 29 |
| abstract_inverted_index.completion | 67 |
| abstract_inverted_index.navigation | 57, 72, 106 |
| abstract_inverted_index.performing | 2 |
| abstract_inverted_index.targets.\n | 107 |
| abstract_inverted_index.We\npresent | 30 |
| abstract_inverted_index.completion. | 49 |
| abstract_inverted_index.environment | 25 |
| abstract_inverted_index.multi-modal | 41 |
| abstract_inverted_index.instructions | 16 |
| abstract_inverted_index.successfully | 91 |
| abstract_inverted_index.Additionally, | 50 |
| abstract_inverted_index.and\ninteract | 10 |
| abstract_inverted_index.long-horizon, | 94 |
| abstract_inverted_index.object-centric | 56 |
| abstract_inverted_index.Language-guided | 0 |
| abstract_inverted_index.language-guided | 64 |
| abstract_inverted_index.non-interactive | 61 |
| abstract_inverted_index.transformer-based | 35 |
| abstract_inverted_index.between\nsuccessful | 55 |
| abstract_inverted_index.dense,\nmulti-modal | 95 |
| abstract_inverted_index.introducing\nobject | 71 |
| abstract_inverted_index.visual\nobservations | 18 |
| abstract_inverted_index.to\nhigh-dimensional, | 40 |
| abstract_inverted_index.utilize\nobject-centric | 105 |
| abstract_inverted_index.competitive\nperformance | 79 |
| abstract_inverted_index.first\ntransformer-based | 88 |
| abstract_inverted_index.for\nlanguage-conditioned | 47 |
| cited_by_percentile_year.max | 99 |
| cited_by_percentile_year.min | 95 |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.5899999737739563 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile.value | 0.92328528 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | True |