Govind Thattai
YOU?
Author Swipe
View article: Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI
Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI Open
The Alexa Prize program has empowered numerous university students to explore, experiment, and showcase their talents in building conversational agents through challenges like the SocialBot Grand Challenge and the TaskBot Challenge. As con…
View article: Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations Open
Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions,…
View article: LEMMA: Learning Language-Conditioned Multi-Robot Manipulation
LEMMA: Learning Language-Conditioned Multi-Robot Manipulation Open
Complex manipulation tasks often require robots with complementary capabilities to collaborate. We introduce a benchmark for LanguagE-Conditioned Multi-robot MAnipulation (LEMMA) focused on task allocation and long-horizon object manipulat…
View article: Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models
Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models Open
Parameter-efficient tuning (PET) methods fit pre-trained language models (PLMs) to downstream tasks by either computing a small compressed update for a subset of model parameters, or appending and fine-tuning a small number of new model pa…
View article: Alexa Arena: A User-Centric Interactive Platform for Embodied AI
Alexa Arena: A User-Centric Interactive Platform for Embodied AI Open
We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With us…
View article: Language-Informed Transfer Learning for Embodied Household Activities
Language-Informed Transfer Learning for Embodied Household Activities Open
For service robots to become general-purpose in everyday household environments, they need not only a large library of primitive skills, but also the ability to quickly learn novel tasks specified by users. Fine-tuning neural networks on a…
View article: GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods Open
A key goal for the advancement of AI is to develop technologies that serve the needs not just of one group but of all communities regardless of their geographical region. In fact, a significant proportion of knowledge is locally shared by …
View article: Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models
Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models Open
Parameter-efficient tuning (PET) methods fit pre-trained language models (PLMs) to downstream tasks by either computing a small compressed update for a subset of model parameters, or appending and fine-tuning a small number of new model pa…
View article: OpenD: A Benchmark for Language-Driven Door and Drawer Opening
OpenD: A Benchmark for Language-Driven Door and Drawer Opening Open
We introduce OPEND, a benchmark for learning how to use a hand to open cabinet doors or drawers in a photo-realistic and physics-reliable simulation environment driven by language instruction. To solve the task, we propose a multi-step pla…
View article: TPA-Net: Generate A Dataset for Text to Physics-based Animation
TPA-Net: Generate A Dataset for Text to Physics-based Animation Open
Recent breakthroughs in Vision-Language (V&L) joint research have achieved remarkable results in various text-driven tasks. High-quality Text-to-video (T2V), a task that has been long considered mission-impossible, was proven feasible with…
View article: Towards Reasoning-Aware Explainable VQA
Towards Reasoning-Aware Explainable VQA Open
The domain of joint vision-language understanding, especially in the context of reasoning in Visual Question Answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on impr…
View article: CH-MARL: A Multimodal Benchmark for Cooperative, Heterogeneous Multi-Agent Reinforcement Learning
CH-MARL: A Multimodal Benchmark for Cooperative, Heterogeneous Multi-Agent Reinforcement Learning Open
We propose a multimodal (vision-and-language) benchmark for cooperative and heterogeneous multi-agent learning. We introduce a benchmark multimodal dataset with tasks involving collaboration between multiple simulated heterogeneous robots …
View article: DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following
DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following Open
Language-guided Embodied AI benchmarks requiring an agent to navigate an environment and manipulate objects typically allow one-way communication: the human user gives a natural language command to the agent, and the agent can only follow …
View article: A Multi-level Alignment Training Scheme for Video-and-Language Grounding
A Multi-level Alignment Training Scheme for Video-and-Language Grounding Open
To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similar…
View article: Privacy Preserving Visual Question Answering
Privacy Preserving Visual Question Answering Open
We introduce a novel privacy-preserving methodology for performing Visual Question Answering on the edge. Our method constructs a symbolic representation of the visual scene, using a low-complexity computer vision model that jointly predic…
View article: Learning to Act with Affordance-Aware Multimodal Neural SLAM
Learning to Act with Affordance-Aware Multimodal Neural SLAM Open
Recent years have witnessed an emerging paradigm shift toward embodied artificial intelligence, in which an agent must learn to solve challenging tasks by interacting with its environment. There are several challenges in solving embodied m…
View article: Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning
Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning Open
We present a two-step hybrid reinforcement learning (RL) policy that is designed to generate interpretable and robust hierarchical policies on the RL problem with graph-based input. Unlike prior deep reinforcement learning policies paramet…
View article: A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering
A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering Open
Outside-knowledge visual question answering (OK-VQA) requires the agent to comprehend the image, make use of relevant knowledge from the entire web, and digest all the information to answer the question. Most previous works address the pro…
View article: Best of Both Worlds: A Hybrid Approach for Multi-Hop Explanation with Declarative Facts
Best of Both Worlds: A Hybrid Approach for Multi-Hop Explanation with Declarative Facts Open
Language-enabled AI systems can answer complex, multi-hop questions to high accuracy, but supporting answers with evidence is a more challenging task which is important for the transparency and trustworthiness to users. Prior work in this …
View article: LUMINOUS: Indoor Scene Generation for Embodied AI Challenges
LUMINOUS: Indoor Scene Generation for Embodied AI Challenges Open
Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges on…
View article: Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion Open
Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodi…
View article: Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual\n Task Completion Open
Language-guided robots performing home and office tasks must navigate in and\ninteract with the world. Grounding language instructions against visual\nobservations and actions to take in an environment is an open challenge. We\npresent Emb…
View article: Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation
Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation Open
GuessWhat?! is a two-player visual dialog guessing game where player A asks a sequence of yes/no questions (Questioner) and makes a final guess (Guesser) about a target object in an image, based on answers from player B (Oracle). Based on …
View article: Are We There Yet? Learning to Localize in Embodied Instruction Following
Are We There Yet? Learning to Localize in Embodied Instruction Following Open
Embodied instruction following is a challenging problem requiring an agent to infer a sequence of primitive actions to achieve a goal environment state from complex language and visual inputs. Action Learning From Realistic Environments an…
View article: Interactive Teaching for Conversational AI
Interactive Teaching for Conversational AI Open
Current conversational AI systems aim to understand a set of pre-designed requests and execute related actions, which limits them to evolve naturally and adapt based on human interactions. Motivated by how children learn their first langua…
View article: LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering
LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering Open
The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitati…