Jianfei Cai
YOU?
Author Swipe
View article: Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning
Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning Open
Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector…
View article: Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video
Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video Open
View article: Sharpness-Aware Data Generation for Zero-shot Quantization
Sharpness-Aware Data Generation for Zero-shot Quantization Open
Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing…
View article: An Empirical Study on How Video-LLMs Answer Video Questions
An Empirical Study on How Video-LLMs Answer Video Questions Open
Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with …
View article: MU-Diff: a mutual learning diffusion model for synthetic MRI with Application for brain lesions
MU-Diff: a mutual learning diffusion model for synthetic MRI with Application for brain lesions Open
View article: Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric Shapes
Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric Shapes Open
Optimizing the similarity between parametric shapes is crucial for numerous computer vision tasks, where Intersection over Union (IoU) stands as the canonical measure. However, existing optimization methods exhibit significant shortcomings…
View article: VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior Open
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabil…
View article: PCGS: Progressive Compression of 3D Gaussian Splatting
PCGS: Progressive Compression of 3D Gaussian Splatting Open
3D Gaussian Splatting (3DGS) achieves impressive rendering fidelity and speed for novel view synthesis. However, its substantial data size poses a significant challenge for practical applications. While many compression techniques have bee…
View article: HAC++: Towards 100X Compression of 3D Gaussian Splatting
HAC++: Towards 100X Compression of 3D Gaussian Splatting Open
3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compres…
View article: Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles
Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles Open
The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops …
View article: PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting
PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting Open
With the advent of portable 360° cameras, panorama has gained significant attention in applications like virtual reality (VR), virtual tours, robotics, and autonomous driving. As a result, wide-baseline panorama view synthesis has emerged …
View article: Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation
Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation Open
Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data. While DFKD methods have achieved success on smaller dataset…
View article: MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views
MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views Open
We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficie…
View article: Normal-GS: 3D Gaussian Splatting with Normal-Involved Rendering
Normal-GS: 3D Gaussian Splatting with Normal-Involved Rendering Open
Rendering and reconstruction are long-standing topics in computer vision and graphics. Achieving both high rendering quality and accurate geometry is a challenge. Recent advancements in 3D Gaussian Splatting (3DGS) have enabled high-fideli…
View article: Point-PRC: A Prompt Learning Based Regulation Framework for Generalizable Point Cloud Analysis
Point-PRC: A Prompt Learning Based Regulation Framework for Generalizable Point Cloud Analysis Open
This paper investigates the 3D domain generalization (3DDG) ability of large 3D models based on prevalent prompt learning. Recent works demonstrate the performances of 3D point cloud recognition can be boosted remarkably by parameter-effic…
View article: Fast Feedforward 3D Gaussian Splatting Compression
Fast Feedforward 3D Gaussian Splatting Compression Open
With 3D Gaussian Splatting (3DGS) advancing real-time and high-fidelity rendering for novel view synthesis, storage requirements pose challenges for their widespread adoption. Although various compression techniques have been proposed, pre…
View article: McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction Open
View article: Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis
Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis Open
This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model …
View article: McCaD: Multi-Contrast MRI Conditioned, Adaptive Adversarial Diffusion Model for High-Fidelity MRI Synthesis
McCaD: Multi-Contrast MRI Conditioned, Adaptive Adversarial Diffusion Model for High-Fidelity MRI Synthesis Open
Magnetic Resonance Imaging (MRI) is instrumental in clinical diagnosis, offering diverse contrasts that provide comprehensive diagnostic information. However, acquiring multiple MRI contrasts is often constrained by high costs, long scanni…
View article: McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction Open
Iso-surface extraction from an implicit field is a fundamental process in various applications of computer vision and graphics. When dealing with geometric shapes with complicated geometric details, many existing algorithms suffer from hig…
View article: How Well Can Vision Language Models See Image Details?
How Well Can Vision Language Models See Image Details? Open
Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unc…
View article: GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Open
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial …
View article: AFFnet - a deep convolutional neural network for the detection of atypical femur fractures from anteriorposterior radiographs
AFFnet - a deep convolutional neural network for the detection of atypical femur fractures from anteriorposterior radiographs Open
Despite well-defined criteria for radiographic diagnosis of atypical femur fractures (AFFs), missed and delayed diagnosis is common. An AFF diagnostic software could provide timely AFF detection to prevent progression of incomplete or deve…
View article: Differentiable Convex Polyhedra Optimization from Multi-view Images
Differentiable Convex Polyhedra Optimization from Multi-view Images Open
This paper presents a novel approach for the differentiable rendering of convex polyhedra, addressing the limitations of recent methods that rely on implicit field supervision. Our technique introduces a strategy that combines non-differen…
View article: SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation
SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation Open
Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, …
View article: DrVideo: Document Retrieval Based Long Video Understanding
DrVideo: Document Retrieval Based Long Video Understanding Open
Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main chal…
View article: PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction
PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction Open
Personalizing a large-scale pretrained Text-to-Image (T2I) diffusion model is challenging as it typically struggles to make an appropriate trade-off between its training data distribution and the target distribution, i.e., learning a novel…
View article: How Far Can We Compress Instant-NGP-Based NeRF?
How Far Can We Compress Instant-NGP-Based NeRF? Open
In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF…
View article: Evaluation of the effectiveness of ankle arthrodesis options
Evaluation of the effectiveness of ankle arthrodesis options Open
Introduction Treatment methods for late stages of ankle osteoarthritis are varied, but the issue of assessing the long-term results of various fixation methods has not yet been studied, and this issue is of great importance in clinical pra…
View article: Taming Stable Diffusion for Text to 360° Panorama Image Generation
Taming Stable Diffusion for Text to 360° Panorama Image Generation Open
Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text…