Xiangyu Yue
YOU?
Author Swipe
View article: Improving the Generalization of Segmentation Foundation Model via Weakly-Supervised and Unsupervised Adaptation
Improving the Generalization of Segmentation Foundation Model via Weakly-Supervised and Unsupervised Adaptation Open
View article: Improving the Generalization of Segmentation Foundation Models via Weakly-Supervised and Unsupervised Adaptation
Improving the Generalization of Segmentation Foundation Models via Weakly-Supervised and Unsupervised Adaptation Open
The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything (SAM), among others, is th…
View article: VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning Open
Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer fr…
View article: VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning Open
We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulat…
View article: Growing Visual Generative Capacity for Pre-Trained MLLMs
Growing Visual Generative Capacity for Pre-Trained MLLMs Open
Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models…
View article: ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Open
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this…
View article: Scaling Up Your Kernels: Large Kernel Design in ConvNets Toward Universal Representations
Scaling Up Your Kernels: Large Kernel Design in ConvNets Toward Universal Representations Open
This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior des…
View article: MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models Open
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly…
View article: Native-Resolution Image Synthesis
Native-Resolution Image Synthesis Open
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution,…
View article: MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs Open
Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensiv…
View article: Learning to Integrate Diffusion ODEs by Averaging the Derivatives
Learning to Integrate Diffusion ODEs by Averaging the Derivatives Open
To accelerate diffusion model inference, numerical solvers perform poorly at extremely small steps, while distillation techniques often introduce complexity and instability. This work presents an intermediate strategy, balancing performanc…
View article: Multimodal Long Video Modeling Based on Temporal Dynamic Context
Multimodal Long Video Modeling Based on Temporal Dynamic Context Open
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amou…
View article: Training Matting Models Without Alpha Labels
Training Matting Models Without Alpha Labels Open
The labeling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present …
View article: Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1: Reinforcing Video Reasoning in MLLMs Open
Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning withi…
View article: UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines Open
Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD…
View article: Breaking the Encoder Barrier for Seamless Video-Language Understanding
Breaking the Encoder Barrier for Seamless Video-Language Understanding Open
Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces re…
View article: Unleashing Vecset Diffusion Model for Fast Shape Generation
Unleashing Vecset Diffusion Model for Fast Shape Generation Open
3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolut…
View article: SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance Open
Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we intro…
View article: Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model
Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model Open
Room layout estimation from multiple-perspective images is poorly investigated due to the complexities that emerge from multi-view geometry, which requires muti-step solutions such as camera intrinsic and extrinsic estimation, image matchi…
View article: Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation Open
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLM…
View article: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States Open
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focus…
View article: Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models Open
Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout…
View article: RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting
RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting Open
While Multimodal Large Language Models (MLLMs) have made remarkable progress in vision-language reasoning, they are also more susceptible to producing harmful content compared to models that focus solely on text. Existing defensive prompti…
View article: FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions
FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions Open
While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases…
View article: DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation Open
Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coh…
View article: Why and How: Knowledge-Guided Learning for Cross-Spectral Image Patch Matching
Why and How: Knowledge-Guided Learning for Cross-Spectral Image Patch Matching Open
Recently, cross-spectral image patch matching based on feature relation learning has attracted extensive attention. However, performance bottleneck problems have gradually emerged in existing methods. To address this challenge, we make the…
View article: From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision Open
Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, exc…
View article: Chimera: Improving Generalist Model with Domain-Specific Experts
Chimera: Improving Generalist Model with Domain-Specific Experts Open
Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, general…
View article: AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Open
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide…
View article: Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Open
The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human’s daily life.…