Wanggui He
YOU?
Author Swipe
View article: DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation Open
The unified autoregressive (AR) model excels at multimodal understanding and generation. However, its full potential in the domain of customized image generation has yet to be fully realized. Existing customization approaches for unified A…
View article: MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis Open
Auto-regressive models have made significant progress in the realm of text-to-image synthesis, yet devising an appropriate model architecture and training strategy to achieve a satisfactory level remains an important avenue of exploration.…
View article: Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback Open
The rapidly developing Large Vision Language Models (LVLMs) still face the hallucination phenomena where the generated responses do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detect…
View article: Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models
Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models Open
Unified generative models have shown remarkable performance in text and image generation. For image synthesis tasks, they adopt straightforward text-to-image (T2I) generation. However, direct T2I generation limits the models in handling co…
View article: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval Open
We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggl…
View article: HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation Open
We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressi…
View article: CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers
CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers Open
Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovat…
View article: T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts Open
Evaluating the quality of synthesized images remains a significant challenge in the development of text-to-image (T2I) generation. Most existing studies in this area primarily focus on evaluating text-image alignment, image quality, and ob…
View article: PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation Open
Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training sta…
View article: A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications Open
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as a…
View article: TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition Open
While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multidimensional task scenarios. To address this issue,…
View article: MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis Open
Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I ge…
View article: MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance Open
Recent advancements in text-to-image generation models have dramatically enhanced the generation of photorealistic images from textual prompts, leading to an increased interest in personalized text-to-image applications, particularly in mu…
View article: Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback Open
The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, signifi…
View article: TrainerAgent: Customizable and Efficient Model Training through LLM-Powered Multi-Agent System
TrainerAgent: Customizable and Efficient Model Training through LLM-Powered Multi-Agent System Open
Training AI models has always been challenging, especially when there is a need for custom models to provide personalized services. Algorithm engineers often face a lengthy process to iteratively develop models tailored to specific busines…
View article: Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training Open
The pre-trained neural models have recently achieved impressive performances in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese…