Qibin Hou
YOU?
Author Swipe
View article: AgeBooth: Controllable Facial Aging and Rejuvenation via Diffusion Models
AgeBooth: Controllable Facial Aging and Rejuvenation via Diffusion Models Open
Recent diffusion model research focuses on generating identity-consistent images from a reference photo, but they struggle to accurately control age while preserving identity, and fine-tuning such models often requires costly paired images…
View article: TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs Open
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcemen…
View article: Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects
Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects Open
Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this fiel…
View article: OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation Open
Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this pape…
View article: Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment
Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment Open
Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference throu…
View article: A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models Open
Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often…
View article: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Open
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively …
View article: Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning Open
Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially …
View article: DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation
DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Open
Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images an…
View article: KAC: Kolmogorov-Arnold Classifier for Continual Learning
KAC: Kolmogorov-Arnold Classifier for Continual Learning Open
Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Insp…
View article: AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction
AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction Open
Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose differ…
View article: K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs Open
Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional tr…
View article: Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT
Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT Open
Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-…
View article: LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding Open
In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary stre…
View article: Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection
Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection Open
While witnessed with rapid development, remote sensing object detection remains challenging for detecting high aspect ratio objects. This paper shows that large strip convolutions are good feature representation learners for remote sensing…
View article: SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection Open
With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional Object detection models are trained on a single dataset, often restricted to a specific imaging modali…
View article: TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction Open
We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrat…
View article: High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation
High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation Open
Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning m…
View article: Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction Open
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fi…
View article: ControlSR: Taming Diffusion Models for Consistent Real-World Image Super Resolution
ControlSR: Taming Diffusion Models for Consistent Real-World Image Super Resolution Open
We present ControlSR, a new method that can tame Diffusion Models for consistent real-world image super-resolution (Real-ISR). Previous Real-ISR models mostly focus on how to activate more generative priors of text-to-image diffusion model…
View article: OPUS: Occupancy Prediction Using a Sparse Set
OPUS: Occupancy Prediction Using a Sparse Set Open
Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment…
View article: Towards Stable 3D Object Detection
Towards Stable 3D Object Detection Open
In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the…
View article: Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation
Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation Open
Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while th…
View article: StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation Open
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new w…
View article: Multi-Task Dense Prediction via Mixture of Low-Rank Experts
Multi-Task Dense Prediction via Mixture of Low-Rank Experts Open
Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a nov…
View article: Polyper: Boundary Sensitive Polyp Segmentation
Polyper: Boundary Sensitive Polyp Segmentation Open
We present a new boundary sensitive framework for polyp segmentation, termed Polyper.Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackl…
View article: LSKNet: A Foundation Lightweight Backbone for Remote Sensing
LSKNet: A Foundation Lightweight Backbone for Remote Sensing Open
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, …