Explanipedia

From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment Open

Han Zhou, Jinjin Cao, Liyuan Ma, Xueji Fang, Guo-Jun Qi · 2025

Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipula…

FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution Open

Siyu Xu, Wenjie Li, Guangwei Gao, Jian Yang, Guo-Jun Qi , et al. · 2025

Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performanc…

Study on microstructural evolution of 304 stainless steel during cryogenic rolling Open

Shuo Li, Miaomiao Zhao, Guo-Jun Qi, Xiaolu Li · 2025

The present study investigated the microstructural evolution of 304 stainless steel during cryogenic rolling. The results indicated that during cryogenic rolling of 304 stainless steel, with a reduction ratio of 20%, 50%, and 80%, the corr…

S2AFormer: Strip Self-Attention for Efficient Vision Transformer Open

Guoan Xu, Wenfeng Huang, Wenjing Jia, Jiamao Li, Guangwei Gao , et al. · 2025

Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as t…

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models Open

Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi · 2025

We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire r…

Cross Paradigm Representation and Alignment Transformer for Image Deraining Open

Shasha Zou, Yi Zou, Juncheng Li, Guangwei Gao, Guo-Jun Qi · 2025

Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge singl…

EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning Open

Xuehao Gao, Yi Yang, Shaoyi Du, Yang Wu, Yebin Liu , et al. · 2025

This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D…

Self-Guidance: Boosting Flow and Diffusion Generation on Their Own Open

Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, Guo-Jun Qi · 2024

Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusi…

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation Open

Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zheng Huang, Weijian Luo , et al. · 2024

Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, whi…

SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation Open

Guoan Xu, Jiaming Chen, Wenfeng Huang, Wenjing Jia, Guangwei Gao , et al. · 2024

The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, V…

Flow Generator Matching Open

Zheng Huang, Zhengyang Geng, Weijian Luo, Guo-Jun Qi · 2024

In the realm of Artificial Intelligence Generated Content (AIGC), flow-matching models have emerged as a powerhouse, achieving success due to their robust theoretical underpinnings and solid ability for large-scale generative modeling. The…

One-Step Diffusion Distillation through Score Implicit Matching Open

Weijian Luo, Zheng Huang, Zhengyang Geng, J. Zico Kolter, Guo-Jun Qi · 2024

Despite their strong performances on many generative tasks, diffusion models require a large number of sampling steps in order to generate realistic samples. This has motivated the community to develop effective methods to distill pre-trai…

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling Open

Zilyu Ye, Jinxiu Liu, Ruotian Peng, Jinjin Cao, Zhiyang Chen , et al. · 2024

Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due …

Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction Open

Xuehao Gao, Yang Yang, yang wu, Shaoyi Du, Guo-Jun Qi · 2024

Inferring 3D human motion is fundamental in many applications, including understanding human activity and analyzing one's intention. While many fruitful efforts have been made to human motion prediction, most approaches focus on pose-drive…

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions Open

Mengyi Shan, Lu Dong, Yutao Han, Yuan Yao, Tao Liu , et al. · 2024

This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one…

Zero-shot High-fidelity and Pose-controllable Character Animation Open

Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su , et al. · 2024

Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor prese…

BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion Open

Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu , et al. · 2024

Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple condit…

UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures Open

Mingyuan Zhou, Rakib Hyder, Ziwei Xuan, Guo-Jun Qi · 2024

Recent advances in 3D avatar generation have gained significant attentions. These breakthroughs aim to produce more realistic animatable avatars, narrowing the gap between virtual and real-world experiences. Most of existing works employ S…

Lightweight high-resolution Subject Matting in the Real World Open

Peng Liu, Fanyi Wang, Jingwen Su, Yanhao Zhang, Guo-Jun Qi · 2023

Existing saliency object detection (SOD) methods struggle to satisfy fast inference and accurate results simultaneously in high resolution scenes. They are limited by the quality of public datasets and efficient network modules for high-re…

BARET : Balanced Attention based Real image Editing driven by Target-text Inversion Open

Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu , et al. · 2023

Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple condit…

OmniMotionGPT: Animal Motion Generation with Limited Data Open

Zhangsihao Yang, Mingyuan Zhou, Mengyi Shan, Bingbing Wen, Ziwei Xuan , et al. · 2023

Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and b…

Exploring the Robustness of Human Parsers Towards Common Corruptions Open

Sanyi Zhang, Xiaochun Cao, Rui Wang, Guo-Jun Qi, Jie Zhou · 2023

Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve …

LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar Open

Yuelang Xu, Hongwen Zhang, Lizhen Wang, Xiaochen Zhao, Han Huang , et al. · 2023

Existing approaches to animatable NeRF-based head avatars are either built upon face templates or use the expression coefficients of templates as the driving signal. Despite the promising progress, their performances are heavily bound by t…

HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events Open

Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Hongkai Xiong , et al. · 2023

Guest Editorial: Special issue on media convergence and intelligent technology in the metaverse Open

Siwei Ma, Maoguo Gong, Guo-Jun Qi, Yun Tie, Ivan Lee , et al. · 2023

The metaverse is a new type of Internet application and social form that integrates a variety of new technologies, including artificial intelligence, digital twins, block chain, cloud computing, virtual reality, robots, with brain-computer…

High-Fidelity Clothed Avatar Reconstruction from a Single Image Open

Tingting Liao, Xiaomei Zhang, Yuliang Xiu, Hongwei Yi, Xudong Liu , et al. · 2023

This paper presents a framework for efficient 3D clothed avatar reconstruction. By combining the advantages of the high accuracy of optimization-based methods and the efficiency of learning-based methods, we propose a coarse-to-fine way to…

Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver Open

Xianpeng Liu, Ce Zheng, Kelvin Cheng, Nan Xue, Guo-Jun Qi , et al. · 2023

The main challenge of monocular 3D object detection is the accurate localization of 3D center. Motivated by a new and strong observation that this challenge can be remedied by a 3D-space local-grid search scheme in an ideal case, we propos…

OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering Open

Zhiyuan Ma, Xiangyu Zhu, Guo-Jun Qi, Zhen Lei, Lei Zhang · 2023

Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. T…

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos Open

Ce Zheng, Guo-Jun Qi, Chen Chen · 2023

Human mesh recovery (HMR) provides rich human body information for various real-world applications. While image-based HMR methods have achieved impressive results, they often struggle to recover humans in dynamic scenarios, leading to temp…

POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery Open

Ce Zheng, Xianpeng Liu, Guo-Jun Qi, Chen Chen · 2023

Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficie…

Guo-Jun Qi YOU? Author Swipe