Boyuan Jiang
YOU?
Author Swipe
View article: Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance
Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance Open
Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and t…
View article: VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption
VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption Open
Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the …
View article: CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors
CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors Open
Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like re…
View article: DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation Open
To enhance the controllability of text-to-image diffusion models, current ControlNet-like models have explored various control signals to dictate image attributes. However, existing methods either handle conditions inefficiently or use a f…
View article: Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing Open
Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target conten…
View article: VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing
VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing Open
Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing da…
View article: FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on Open
Although image-based virtual try-on has made considerable progress, emerging approaches still encounter challenges in producing high-fidelity and robust fitting images across diverse scenarios. These methods often struggle with issues such…
View article: Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content Open
With the continuous progress of visual generation technologies, the scale of video datasets has grown exponentially. The quality of these datasets plays a pivotal role in the performance of video generation models. We assert that temporal …
View article: VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding
VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding Open
Although diffusion-based image virtual try-on has made considerable progress, emerging approaches still struggle to effectively address the issue of hand occlusion (i.e., clothing regions occluded by the hand part), leading to a notable de…
View article: Oracle Bone Inscriptions Multi-modal Dataset
Oracle Bone Inscriptions Multi-modal Dataset Open
Oracle bone inscriptions(OBI) is the earliest developed writing system in China, bearing invaluable written exemplifications of early Shang history and paleography. However, the task of deciphering OBI, in the current climate of the schola…
View article: NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models
NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models Open
Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating leng…
View article: A Multimodal, Multi-Task Adapting Framework for Video Action Recognition
A Multimodal, Multi-Task Adapting Framework for Video Action Recognition Open
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing …
View article: M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition Open
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing …
View article: PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization
PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization Open
Recent advancements in personalized image generation using diffusion models have been noteworthy. However, existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive …
View article: Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation
Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation Open
3D human pose estimation has been a long-standing challenge in computer vision and graphics, where multi-view methods have significantly progressed but are limited by the tedious calibration processes. Existing multi-view methods are restr…
View article: Dynamic Frame Interpolation in Wavelet Domain
Dynamic Frame Interpolation in Wavelet Domain Open
Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. Howe…
View article: Pose-Aware Attention Network for Flexible Motion Retargeting by Body Part
Pose-Aware Attention Network for Flexible Motion Retargeting by Body Part Open
Motion retargeting is a fundamental problem in computer graphics and computer vision. Existing approaches usually have many strict requirements, such as the source-target skeletons needing to have the same number of joints or share the sam…
View article: IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation
IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation Open
Prevailing video frame interpolation algorithms, that generate the intermediate frames from consecutive inputs, typically rely on complex model architectures with heavy parameters or large delay, hindering them from diverse real-time appli…
View article: Quantitative susceptibility mapping to evaluate brain iron deposition and its correlation with physiological parameters in hypertensive patients
Quantitative susceptibility mapping to evaluate brain iron deposition and its correlation with physiological parameters in hypertensive patients Open
These results are indicative of the role of overload brain iron in deep brain gray matter nuclei in HP and suggest that HP is associated with excess brain iron in certain deep gray matter regions.
View article: Learning Comprehensive Motion Representation for Action Recognition
Learning Comprehensive Motion Representation for Action Recognition Open
For action recognition learning, 2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing inter-f…
View article: Learning Comprehensive Motion Representation for Action Recognition
Learning Comprehensive Motion Representation for Action Recognition Open
For action recognition learning, 2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing inter-f…
View article: Multi-Level Adaptive Region of Interest and Graph Learning for Facial Action Unit Recognition
Multi-Level Adaptive Region of Interest and Graph Learning for Facial Action Unit Recognition Open
In facial action unit (AU) recognition tasks, regional feature learning and AU relation modeling are two effective aspects which are worth exploring. However, the limited representation capacity of regional features makes it difficult for …
View article: Imputation of Missing Traffic Flow Data Using Denoising Autoencoders
Imputation of Missing Traffic Flow Data Using Denoising Autoencoders Open
In transportation engineering, spatio-temporal data including traffic flow, speed, and occupancy are collected from different kinds of sensors and used by transportation engineers for analysis. However, the missing data influence the analy…
View article: Hyperparameter Tuning to Optimize Implementations of Denoising Autoencoders for Imputation of Missing Spatio-temporal Data
Hyperparameter Tuning to Optimize Implementations of Denoising Autoencoders for Imputation of Missing Spatio-temporal Data Open
Spatio-temporal data collected from sensors can sometimes have gaps where data is missing. Transportation planners and engineers use such data to perform various different types of analyses, but the gaps in the data make it difficult to ma…
View article: STM: SpatioTemporal and Motion Encoding for Action Recognition
STM: SpatioTemporal and Motion Encoding for Action Recognition Open
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion f…
View article: Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation
Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation Open
Recently, considerable effort has been devoted to deep domain adaptation in computer vision and machine learning communities. However, most of existing work only concentrates on learning shared feature representation by minimizing the dist…
View article: Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation
Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation Open
One crucial aspect of partial domain adaptation (PDA) is how to select the relevant source samples in the shared classes for knowledge transfer. Previous PDA methods tackle this problem by re-weighting the source samples based on their hig…
View article: Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation
Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation Open
Recently, considerable effort has been devoted to deep domain adaptation in computer vision and machine learning communities. However, most of existing work only concentrates on learning shared feature representation by minimizing the dist…