Alan Yuille
YOU?
Author Swipe
View article: Autoregressive Video Generation beyond Next Frames Prediction
Autoregressive Video Generation beyond Next Frames Prediction Open
Autoregressive models for video generation typically operate frame-by-frame, extending next-token prediction from language to video's temporal dimension. We question that unlike word as token is universally agreed in language if frame is a…
View article: Mixture of Contexts for Long Video Generation
Mixture of Contexts for Long Video Generation Open
Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context video…
View article: Captain Cinema: Towards Short Movie Generation
Captain Cinema: Towards Short Movie Generation Open
We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensure…
View article: 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos Open
Existing methods for reconstructing animatable 3D animals from videos typically rely on sparse semantic keypoints to fit parametric models. However, obtaining such keypoints is labor-intensive, and keypoint detectors trained on limited ani…
View article: Learning Segmentation from Radiology Reports
Learning Segmentation from Radiology Reports Open
Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor mas…
View article: OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions Open
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as …
View article: Fake it till You Make it: Reward Modeling as Discriminative Prediction
Fake it till You Make it: Reward Modeling as Discriminative Prediction Open
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance …
View article: Mamba-Reg: Vision Mamba Also Needs Registers
Mamba-Reg: Vision Mamba Also Needs Registers Open
Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much mo…
View article: Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?
Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction? Open
Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomic…
View article: Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning
Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning Open
Providing effective treatment and making informed clinical decisions are essential goals of modern medicine and clinical care. We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in lar…
View article: PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation
PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation Open
Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-p…
View article: Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering Open
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To syst…
View article: Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers
Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers Open
Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192\times 8192$ image can take over an hour on an A100 GPU. In …
View article: SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models Open
Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reaso…
View article: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation Open
Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smoot…
View article: DINeMo: Learning Neural Mesh Models with no 3D Annotations
DINeMo: Learning Neural Mesh Models with no 3D Annotations Open
Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a ran…
View article: Dictionary-based Framework for Interpretable and Consistent Object Parsing
Dictionary-based Framework for Interpretable and Consistent Object Parsing Open
In this work, we present CoCal, an interpretable and consistent object parsing framework based on dictionary-based mask transformer. Designed around Contrastive Components and Logical Constraints, CoCal rethinks existing cluster-based mask…
View article: Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models Open
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks f…
View article: EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference
EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference Open
The rapid growth of large models has raised concerns about their environmental impact and equity in accessibility due to significant computational costs. Low-Rank Adapters (LoRA) offer a lightweight solution for finetuning large models, re…
View article: Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More Open
Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively sh…
View article: How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?
How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks? Open
The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained on PASCAL from scratch. W…
View article: VideoAuteur: Towards Long Narrative Video Generation
VideoAuteur: Towards Long Narrative Video Generation Open
Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limitin…
View article: RadGPT: Constructing 3D Image-Text Tumor Datasets
RadGPT: Constructing 3D Image-Text Tumor Datasets Open
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI A…
View article: Expectation-Maximization as the Engine of Scalable Medical Intelligence
Expectation-Maximization as the Engine of Scalable Medical Intelligence Open
Large, high-quality, annotated datasets are the foundation of medical AI research, but constructing even a small, moderate-quality, annotated dataset can take years of effort from multidisciplinary teams. Although active learning can prior…
View article: Text-Driven Tumor Synthesis
Text-Driven Tumor Synthesis Open
Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from r…
View article: FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching Open
Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR p…
View article: Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution Open
Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise t…
View article: GenEx: Generating an Explorable World
GenEx: Generating an Explorable World Open
Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of pl…
View article: 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Open
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their…