Explanipedia

DialogGen: Multi-modal Interactive Dialogue System with Multi-turn Text-Image Generation Open

Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong , et al. · 2025

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data Open

Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Songcen Xu , et al. · 2025

Efficient Multi-modal Large Language Models via Visual Token Grouping Open

Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng , et al. · 2024

The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question a…

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation Open

Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun , et al. · 2024

The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by…

DAPE: Data-Adaptive Positional Encoding for Length Extrapolation Open

Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li , et al. · 2024

Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to disti…

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding Open

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long , et al. · 2024

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We a…

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation Open

Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li , et al. · 2024

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency …

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation Open

Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong , et al. · 2024

Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inabi…

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data Open

Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu , et al. · 2023

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise re…

Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search Open

Minbin Huang, Zhijian Huang, Changlin Li, Xin Chen, Hang Xu , et al. · 2022

Neural Architecture Search (NAS) aims to find efficient models for multiple tasks. Beyond seeking solutions for a single task, there are surging interests in transferring network design knowledge across multiple tasks. In this line of rese…

Minbin Huang YOU? Author Swipe