Explanipedia

Improving the Generalization of Segmentation Foundation Model via Weakly-Supervised and Unsupervised Adaptation Open

Haojie Zhang, Yongyi Su, Nanqing Liu, Shijie Li, Xulei Yang , et al. · 2025

Improving the Generalization of Segmentation Foundation Models via Weakly-Supervised and Unsupervised Adaptation Open

Haojie Zhang, Yongyi Su, Nanqing Liu, Shijie Li, Xulei Yang , et al. · 2025

The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything (SAM), among others, is th…

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning Open

Qi Wang, Jie Liu, Jiajun Liang, Yi Jiang, Yuanxing Zhang , et al. · 2025

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer fr…

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning Open

Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu , et al. · 2025

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulat…

Growing Visual Generative Capacity for Pre-Trained MLLMs Open

Hanyu Wang, J. Han, Qi Zhao, Shanchuan Lin, Xiangyu Yue , et al. · 2025

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models…

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Open

Zhaoyang Liu, Jingjing Xie, Zichen Ding, Z.‐L. Li, Bin Yang , et al. · 2025

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this…

Scaling Up Your Kernels: Large Kernel Design in ConvNets Toward Universal Representations Open

Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue · 2025

This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior des…

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models Open

Yinan Xia, Yilei Jiang, Y. H. Tan, Xiaoyong Zhu, Xiangyu Yue , et al. · 2025

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly…

Native-Resolution Image Synthesis Open

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang · 2025

We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution,…

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs Open

Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang , et al. · 2025

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensiv…

Learning to Integrate Diffusion ODEs by Averaging the Derivatives Open

Wenxiang Liu, Xiangyu Yue · 2025

To accelerate diffusion model inference, numerical solvers perform poorly at extremely small steps, while distillation techniques often introduce complexity and instability. This work presents an intermediate strategy, balancing performanc…

Multimodal Long Video Modeling Based on Temporal Dynamic Context Open

Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue · 2025

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amou…

Training Matting Models Without Alpha Labels Open

Wenze Liu, Zixuan Ye, Hao Lü, Zhiguo Cao, Xiangyu Yue · 2025

The labeling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present …

Video-R1: Reinforcing Video Reasoning in MLLMs Open

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang , et al. · 2025

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning withi…

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines Open

Chen Tang, Xinzhu Ma, Encheng Su, Xuemeng Song, Xiaohong Liu , et al. · 2025

Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD…

Breaking the Encoder Barrier for Seamless Video-Language Understanding Open

Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, Jing Liu · 2025

Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces re…

Unleashing Vecset Diffusion Model for Fast Shape Generation Open

Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang , et al. · 2025

3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolut…

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance Open

Peishan Cong, Ziyi Wang, Yuexin Ma, Xiangyu Yue · 2025

Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we intro…

Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model Open

Yaxuan Huang, Xili Dai, Jianan Wang, Xianbiao Qi, Yixing Yuan , et al. · 2025

Room layout estimation from multiple-perspective images is poorly investigated due to the complexities that emerge from multi-view geometry, which requires muti-step solutions such as camera intrinsic and extrinsic estimation, image matchi…

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation Open

Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine , et al. · 2025

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLM…

HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States Open

Yilei Jiang, Xinyan Gao, Tianshuo Peng, Y. H. Tan, Xiaoyong Zhu , et al. · 2025

The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focus…

Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models Open

Y. H. Tan, Yi Jiang, Yihong Li, Jiaheng Liu, Xingyuan Bu , et al. · 2025

Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout…

RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting Open

Yilei Jiang, Y. H. Tan, Xiangyu Yue · 2024

While Multimodal Large Language Models (MLLMs) have made remarkable progress in vision-language reasoning, they are also more susceptible to producing harmful content compared to models that focus solely on text. Existing defensive prompti…

FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions Open

Yilei Jiang, Weihong Li, Yiyuan Zhang, Minghong Cai, Xiangyu Yue · 2024

While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases…

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation Open

Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang , et al. · 2024

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coh…

Why and How: Knowledge-Guided Learning for Cross-Spectral Image Patch Matching Open

Chuang Yu, Yunpeng Liu, Jinmiao Zhao, Xiangyu Yue · 2024

Recently, cross-spectral image patch matching based on feature relation learning has attracted extensive attention. However, performance bottleneck problems have gradually emerged in existing methods. To address this challenge, we make the…

From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision Open

Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Sicheng Zhao, Xiangyu Yue · 2024

Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, exc…

Chimera: Improving Generalist Model with Domain-Specific Experts Open

Tzu‐Rong Peng, Mingsheng Li, Hongbin Zhou, Renqiu Xia, Renrui Zhang , et al. · 2024

Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, general…

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Open

Kaixiong Gong, Kailun Feng, Bohao Li, Yibing Wang, Minquan Cheng , et al. · 2024

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide…

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Open

Haoran Hao, Jiaming Han, Changsheng Li, Yufeng Li, Xiangyu Yue · 2024

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human’s daily life.…

Xiangyu Yue YOU? Author Swipe