Bingyi Kang
YOU?
Author Swipe
View article: Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models Open
Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning. By injecting action components into the VLMs, Vision-Language-Action models (VLAs) can be naturally …
View article: Uncovering Untapped Potential in Sample-Efficient World Model Agents
Uncovering Untapped Potential in Sample-Efficient World Model Agents Open
World model (WM) agents enable sample-efficient reinforcement learning by learning policies entirely from simulated experience. However, existing token-based world models (TBWMs) are limited to visual inputs and discrete actions, restricti…
View article: Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos Open
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been pro…
View article: VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos Open
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive v…
View article: Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models Open
Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning. By injecting action components into the VLMs, Vision-Language-Action Models (VLAs) can be naturally …
View article: Classification Done Right for Vision-Language Pre-Training
Classification Done Right for Vision-Language Pre-Training Open
We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as…
View article: DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Open
MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human i…
View article: How Far is Video Generation from World Model: A Physical Law Perspective
How Far is Video Generation from World Model: A Physical Law Perspective Open
OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human pr…
View article: Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Loong: Generating Minute-level Long Videos with Autoregressive Language Models Open
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natu…
View article: Depth Anything V2
Depth Anything V2 Open
This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much …
View article: Improving Token-Based World Models with Parallel Observation Prediction
Improving Token-Based World Models with Parallel Observation Prediction Open
Motivated by the success of Transformers when applied to sequences of discrete symbols, token-based world models (TBWMs) were recently proposed as sample-efficient methods. In TBWMs, the world model consumes agent experience as a language-…
View article: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data Open
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circum…
View article: Research on the double-layer clustering method of residential energy use characteristics under the background of energy system energy savings and carbon reduction
Research on the double-layer clustering method of residential energy use characteristics under the background of energy system energy savings and carbon reduction Open
Accurate differentiation of energy consumption information of residential users is of great significance for load planning, scheduling, operation and management of power system, and is the basic premise for realizing intelligent perception…
View article: Harnessing Diffusion Models for Visual Perception with Meta Prompts
Harnessing Diffusion Models for Visual Perception with Meta Prompts Open
The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual…
View article: FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models
FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models Open
Semantic segmentation has witnessed tremendous progress due to the proposal of various advanced network architectures. However, they are extremely hungry for delicate annotations to train, and the acquisition is laborious and unaffordable.…
View article: Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL Open
The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping va…
View article: BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Open
LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abili…
View article: Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning
Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning Open
Deep reinforcement learning (RL) algorithms suffer severe performance degradation when the interaction data is scarce, which limits their real-world application. Recently, visual representation learning has been shown to be effective and p…
View article: Decoupled Prioritized Resampling for Offline RL
Decoupled Prioritized Resampling for Offline RL Open
Offline reinforcement learning (RL) is challenged by the distributional shift problem. To address this problem, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. H…
View article: Improving and Benchmarking Offline Reinforcement Learning Algorithms
Improving and Benchmarking Offline Reinforcement Learning Algorithms Open
Recently, Offline Reinforcement Learning (RL) has achieved remarkable progress with the emergence of various algorithms and datasets. However, these methods usually focus on algorithmic advancements, ignoring that many low-level implementa…
View article: Efficient Diffusion Policies for Offline Reinforcement Learning
Efficient Diffusion Policies for Offline Reinforcement Learning Open
Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by …
View article: MADiff: Offline Multi-agent Learning with Diffusion Models
MADiff: Offline Multi-agent Learning with Diffusion Models Open
Offline reinforcement learning (RL) aims to learn policies from pre-existing datasets without further interactions, making it a challenging task. Q-learning algorithms struggle with extrapolation errors in offline settings, while supervise…
View article: Bag of Tricks for Training Data Extraction from Language Models
Bag of Tricks for Training Data Extraction from Language Models Open
With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty o…
View article: Phylogenomics, plastome structure and species identification in Mahonia (Berberidaceae)
Phylogenomics, plastome structure and species identification in Mahonia (Berberidaceae) Open
Background Elucidating the phylogenetic relationships within species-rich genera is essential but challenging, especially when lineages are assumed to have been going through radiation events. Mahonia Nutt. (Berberidaceae) is a genus with …
View article: Boosting Offline Reinforcement Learning via Data Rebalancing
Boosting Offline Reinforcement Learning via Data Rebalancing Open
Offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. To address this problem, existing works mainly focus on designing sophisticated algorithms to explicitly or implicitly co…