Explanipedia

AgeBooth: Controllable Facial Aging and Rejuvenation via Diffusion Models Open

S. H. Zhu, Biao Cao, Zhen Li, Peng-Tao Jiang, Qibin Hou · 2025

Recent diffusion model research focuses on generating identity-consistent images from a reference photo, but they struggle to accurately control age while preserving identity, and fine-tuning such models often requires costly paired images…

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs Open

Yunheng Li, Jing Cheng, S. Jia, Hai-ju Kuang, Shaohui Jiao , et al. · 2025

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcemen…

Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects Open

Le Zhang, Ao Li, Qibin Hou, Ce Zhu, Yonina C. Eldar · 2025

Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this fiel…

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation Open

Jian Cao, Yumin Chen, Qibin Hou · 2025

Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this pape…

Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment Open

Yunheng Li, Yu-Huan Wu, Qibin Hou · 2025

Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference throu…

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models Open

Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming‐Ming Cheng , et al. · 2025

Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often…

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Open

Baozhong Sun, Qibin Hou · 2025

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively …

Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning Open

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhiqiang Cai, Lujian Yao , et al. · 2025

Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially …

DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Open

Bowen Yin, Qibin Hou · 2025

Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images an…

KAC: Kolmogorov-Arnold Classifier for Continual Learning Open

Yusong Hu, Zichen Liang, Fei Yang, Qibin Hou, Xialei Liu , et al. · 2025

Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Insp…

AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction Open

Xuying Zhang, Yupeng Zhou, Kai Wang, Yi‐Kai Wang, Zhen Li , et al. · 2025

Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose differ…

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs Open

Zhongtao Ouyang, Zhen Li, Qibin Hou · 2025

Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional tr…

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT Open

Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang , et al. · 2025

Computer science Geography

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-…

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding Open

Jiaxing Zhao, Boyuan Sun, Lianglong Chen, Xihan Wei, Qibin Hou · 2025

Computer science Physics Philosophy

In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary stre…

Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection Open

Xinbin Yuan, Zhaohui Zheng, Yuxuan Li, Xialei Liu, Li Liu , et al. · 2025

Computer science Geography

While witnessed with rapid development, remote sensing object detection remains challenging for detecting high aspect ratio objects. This paper shows that large strip convolutions are good feature representation learners for remote sensing…

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection Open

Y. Li, Xiangyang Li, Yunheng Li, Yicheng Zhang, Yimian Dai , et al. · 2024

Computer science Geography Chemistry

With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional Object detection models are trained on a single dataset, often restricted to a specific imaging modali…

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction Open

Xuying Zhang, Yutong Liu, Yangguang Li, Renrui Zhang, Yufei Liu , et al. · 2024

Computer science Business Physics

We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrat…

High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation Open

Quansheng Zeng, Yunheng Li, Daquan Zhou, Guanbin Li, Qibin Hou , et al. · 2024

Computer science Philosophy

Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning m…

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction Open

Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou , et al. · 2024

Computer science Philosophy

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fi…

ControlSR: Taming Diffusion Models for Consistent Real-World Image Super Resolution Open

Yuhao Wan, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen , et al. · 2024

Computer science Geography Physics

We present ControlSR, a new method that can tame Diffusion Models for consistent real-world image super-resolution (Real-ISR). Previous Real-ISR models mostly focus on how to activate more generative priors of text-to-image diffusion model…

OPUS: Occupancy Prediction Using a Sparse Set Open

Jiabao Wang, Zi-Xuan Liu, Qiang Meng, Liujiang Yan, Ke Wang , et al. · 2024

Computer science Engineering

Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment…

Towards Stable 3D Object Detection Open

Jiabao Wang, Qiang Meng, Guochao Liu, Liujiang Yan, Ke Wang , et al. · 2024

Computer science

In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the…

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation Open

Yunheng Li, Zhongyu Li, Quansheng Zeng, Qibin Hou, Ming‐Ming Cheng · 2024

Computer science Physics Chemistry

Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while th…

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation Open

Yupeng Zhou, Daquan Zhou, Ming‐Ming Cheng, Jiashi Feng, Qibin Hou · 2024

Computer science Psychology Engineering

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new w…

Multi-Task Dense Prediction via Mixture of Low-Rank Experts Open

Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen , et al. · 2024

Computer science Mathematics Economics

Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a nov…

Polyper: Boundary Sensitive Polyp Segmentation Open

Hao Shao, Yang Zhang, Qibin Hou · 2024

Computer science Mathematics

We present a new boundary sensitive framework for polyp segmentation, termed Polyper.Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackl…

LSKNet: A Foundation Lightweight Backbone for Remote Sensing Open

Yuxuan Li, Xiang Li, Yimain Dai, Qibin Hou, Li Liu , et al. · 2024

Computer science Geology Geography

Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, …

Qibin Hou YOU? Author Swipe