Peihao Chen
YOU?
Author Swipe
View article: Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance
Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance Open
Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has gre…
View article: LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences Open
Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual fea…
View article: 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning
3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning Open
Constructing compact and informative 3D scene representations is essential for effective embodied exploration and reasoning, especially in complex environments over extended periods. Existing representations, such as object-centric 3D scen…
View article: FlexAttention for Efficient High-Resolution Vision-Language Models
FlexAttention for Efficient High-Resolution Vision-Language Models Open
Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we pro…
View article: CoNav: A Benchmark for Human-Centered Collaborative Navigation
CoNav: A Benchmark for Human-Centered Collaborative Navigation Open
Human-robot collaboration, in which the robot intelligently assists the human with the upcoming task, is an appealing objective. To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, wher…
View article: MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling
MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling Open
Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-…
View article: 3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA: A 3D Vision-Language-Action Generative World Model Open
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecti…
View article: Total syntheses of Tetrodotoxin and 9-epiTetrodotoxin
Total syntheses of Tetrodotoxin and 9-epiTetrodotoxin Open
Tetrodotoxin and congeners are specific voltage-gated sodium channel blockers that exhibit remarkable anesthetic and analgesic effects. Here, we present a scalable asymmetric syntheses of Tetrodotoxin and 9- epi Tetrodotoxin from the abund…
View article: MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World Open
Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking…
View article: Total Synthesis of Tetrodotoxin and 9-epiTetrodotoxin
Total Synthesis of Tetrodotoxin and 9-epiTetrodotoxin Open
The original dataset of "Total Synthesis of Tetrodotoxin and 9-epiTetrodotoxin", manuscript#: NCOMMS-23-01160D
View article: SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector
SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector Open
In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple \textbf{knowledge distillation} approa…
View article: DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning
DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning Open
Learning optimal behavior policy for each agent in multi-agent systems is an essential yet difficult problem. Despite fruitful progress in multi-agent reinforcement learning, the challenge of addressing the dynamics of whether two agents s…
View article: CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding Open
A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities …
View article: FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation
FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation Open
Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems. The agent is required to reason the goal location from where a picture is shot. Existing methods try to solve this problem by lear…
View article: $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models
$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models Open
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction ann…
View article: 3D-LLM: Injecting the 3D World into Large Language Models
3D-LLM: Injecting the 3D World into Large Language Models Open
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves rich…
View article: Learning Vision-and-Language Navigation from YouTube Videos
Learning Vision-and-Language Navigation from YouTube Videos Open
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instru…
View article: Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition
Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition Open
This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, a…
View article: Detecting the open-world objects with the help of the Brain
Detecting the open-world objects with the help of the Brain Open
Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) benchmarks and real-world object detection. In addition to detecting and classifying s…
View article: Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation
Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation Open
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To a…
View article: Learning Active Camera for Multi-Object Navigation
Learning Active Camera for Multi-Object Navigation Open
Getting robots to navigate to multiple objects autonomously is essential yet difficult in robot applications. One of the key challenges is how to explore environments efficiently with camera sensors only. Existing navigation methods mainly…
View article: Masked Motion Encoding for Self-Supervised Video Representation Learning
Masked Motion Encoding for Self-Supervised Video Representation Learning Open
How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. How…
View article: RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning Open
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely ch…
View article: RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning Open
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely ch…
View article: Discovery of a molecular glue promoting CDK12-DDB1 interaction to trigger cyclin K degradation
Discovery of a molecular glue promoting CDK12-DDB1 interaction to trigger cyclin K degradation Open
Molecular-glue degraders mediate interactions between target proteins and components of the ubiquitin-proteasome system to cause selective protein degradation. Here, we report a new molecular glue HQ461 discovered by high-throughput screen…
View article: Author response: Discovery of a molecular glue promoting CDK12-DDB1 interaction to trigger cyclin K degradation
Author response: Discovery of a molecular glue promoting CDK12-DDB1 interaction to trigger cyclin K degradation Open
Article Figures and data Abstract Introduction Results Discussion Materials and methods Appendix 1 Appendix 2 Data availability References Decision letter Author response Article and author information Metrics Abstract Molecular-glue degra…
View article: Location-aware Graph Convolutional Networks for Video Question Answering
Location-aware Graph Convolutional Networks for Video Question Answering Open
We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on v…
View article: Foley Music: Learning to Generate Music from Videos
Foley Music: Learning to Generate Music from Videos Open
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music …