Guo-Jun Qi
YOU?
Author Swipe
View article: From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment
From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment Open
Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipula…
View article: FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution
FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution Open
Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performanc…
View article: Study on microstructural evolution of 304 stainless steel during cryogenic rolling
Study on microstructural evolution of 304 stainless steel during cryogenic rolling Open
The present study investigated the microstructural evolution of 304 stainless steel during cryogenic rolling. The results indicated that during cryogenic rolling of 304 stainless steel, with a reduction ratio of 20%, 50%, and 80%, the corr…
View article: S2AFormer: Strip Self-Attention for Efficient Vision Transformer
S2AFormer: Strip Self-Attention for Efficient Vision Transformer Open
Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as t…
View article: Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models Open
We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire r…
View article: Cross Paradigm Representation and Alignment Transformer for Image Deraining
Cross Paradigm Representation and Alignment Transformer for Image Deraining Open
Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge singl…
View article: EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning
EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning Open
This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D…
View article: Self-Guidance: Boosting Flow and Diffusion Generation on Their Own
Self-Guidance: Boosting Flow and Diffusion Generation on Their Own Open
Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusi…
View article: Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation
Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation Open
Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, whi…
View article: SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation
SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation Open
The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, V…
View article: Flow Generator Matching
Flow Generator Matching Open
In the realm of Artificial Intelligence Generated Content (AIGC), flow-matching models have emerged as a powerhouse, achieving success due to their robust theoretical underpinnings and solid ability for large-scale generative modeling. The…
View article: One-Step Diffusion Distillation through Score Implicit Matching
One-Step Diffusion Distillation through Score Implicit Matching Open
Despite their strong performances on many generative tasks, diffusion models require a large number of sampling steps in order to generate realistic samples. This has motivated the community to develop effective methods to distill pre-trai…
View article: Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling Open
Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due …
View article: Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction
Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction Open
Inferring 3D human motion is fundamental in many applications, including understanding human activity and analyzing one's intention. While many fruitful efforts have been made to human motion prediction, most approaches focus on pose-drive…
View article: Towards Open Domain Text-Driven Synthesis of Multi-Person Motions
Towards Open Domain Text-Driven Synthesis of Multi-Person Motions Open
This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one…
View article: Zero-shot High-fidelity and Pose-controllable Character Animation
Zero-shot High-fidelity and Pose-controllable Character Animation Open
Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor prese…
View article: BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion
BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion Open
Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple condit…
View article: UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures Open
Recent advances in 3D avatar generation have gained significant attentions. These breakthroughs aim to produce more realistic animatable avatars, narrowing the gap between virtual and real-world experiences. Most of existing works employ S…
View article: Lightweight high-resolution Subject Matting in the Real World
Lightweight high-resolution Subject Matting in the Real World Open
Existing saliency object detection (SOD) methods struggle to satisfy fast inference and accurate results simultaneously in high resolution scenes. They are limited by the quality of public datasets and efficient network modules for high-re…
View article: BARET : Balanced Attention based Real image Editing driven by Target-text Inversion
BARET : Balanced Attention based Real image Editing driven by Target-text Inversion Open
Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple condit…
View article: OmniMotionGPT: Animal Motion Generation with Limited Data
OmniMotionGPT: Animal Motion Generation with Limited Data Open
Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and b…
View article: Exploring the Robustness of Human Parsers Towards Common Corruptions
Exploring the Robustness of Human Parsers Towards Common Corruptions Open
Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve …
View article: LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar
LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar Open
Existing approaches to animatable NeRF-based head avatars are either built upon face templates or use the expression coefficients of templates as the driving signal. Despite the promising progress, their performances are heavily bound by t…
View article: HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events
HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events Open
View article: Guest Editorial: Special issue on media convergence and intelligent technology in the metaverse
Guest Editorial: Special issue on media convergence and intelligent technology in the metaverse Open
The metaverse is a new type of Internet application and social form that integrates a variety of new technologies, including artificial intelligence, digital twins, block chain, cloud computing, virtual reality, robots, with brain-computer…
View article: High-Fidelity Clothed Avatar Reconstruction from a Single Image
High-Fidelity Clothed Avatar Reconstruction from a Single Image Open
This paper presents a framework for efficient 3D clothed avatar reconstruction. By combining the advantages of the high accuracy of optimization-based methods and the efficiency of learning-based methods, we propose a coarse-to-fine way to…
View article: Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver
Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver Open
The main challenge of monocular 3D object detection is the accurate localization of 3D center. Motivated by a new and strong observation that this challenge can be remedied by a 3D-space local-grid search scheme in an ideal case, we propos…
View article: OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering
OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering Open
Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. T…
View article: DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos
DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos Open
Human mesh recovery (HMR) provides rich human body information for various real-world applications. While image-based HMR methods have achieved impressive results, they often struggle to recover humans in dynamic scenarios, leading to temp…
View article: POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery
POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery Open
Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficie…