Quanting Xie
YOU?
Author Swipe
View article: MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning Open
Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substanti…
View article: Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation Open
There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhorse of large-scale non-para…
View article: Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis Open
Building general-purpose robots that operate seamlessly in any environment, with any object, and utilizing various skills to complete diverse tasks has been a long-standing goal in Artificial Intelligence. However, as a community, we have …
View article: Reasoning about the Unseen for Efficient Outdoor Object Navigation
Reasoning about the Unseen for Efficient Outdoor Object Navigation Open
Robots should exist anywhere humans do: indoors, outdoors, and even unmapped environments. In contrast, the focus of recent advancements in Object Goal Navigation(OGN) has targeted navigating in indoor environments by leveraging spatial an…