Deva Ramanan
YOU?
Author Swipe
View article: Towards Foundational Models for Single-Chip Radar
Towards Foundational Models for Single-Chip Radar Open
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially…
View article: Label Uncertainty for Ultrasound Segmentation
Label Uncertainty for Ultrasound Segmentation Open
In medical imaging, inter-observer variability among radiologists often introduces label uncertainty, particularly in modalities where visual interpretation is subjective. Lung ultrasound (LUS) is a prime example-it frequently presents a m…
View article: MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion Open
We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expens…
View article: Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos
Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos Open
We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is…
View article: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models Open
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-dist…
View article: DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion Open
Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning app…
View article: AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis Open
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pai…
View article: Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization Open
Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity acr…
View article: Using Diffusion Priors for Video Amodal Segmentation
Using Diffusion Priors for Video Amodal Segmentation Open
Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the wo…
View article: LEARNER: Contrastive Pretraining for Learning Fine-Grained Patient Progression from Coarse Inter-Patient Labels
LEARNER: Contrastive Pretraining for Learning Fine-Grained Patient Progression from Coarse Inter-Patient Labels Open
Predicting whether a treatment leads to meaningful improvement is a central challenge in personalized medicine, particularly when disease progression manifests as subtle visual changes over time. While data-driven deep learning (DL) offers…
View article: NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples Open
Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs …
View article: Neural Eulerian Scene Flow Fields
Neural Eulerian Scene Flow Fields Open
We reframe scene flow as the task of estimating a continuous space-time ODE that describes motion for an entire observation sequence, represented with a neural prior. Our method, EulerFlow, optimizes this neural prior estimate against seve…
View article: Lidar Panoptic Segmentation in an Open World
Lidar Panoptic Segmentation in an Open World Open
View article: GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation Open
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct…
View article: SMORE: Simultaneous Map and Object REconstruction
SMORE: Simultaneous Map and Object REconstruction Open
We present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We t…
View article: Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection
Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection Open
State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that se…
View article: Reanimating Images using Neural Representations of Dynamic Stimuli
Reanimating Images using Neural Representations of Dynamic Stimuli Open
While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenar…
View article: RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection
RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection Open
This repository contains all the collected and aligned data for RU-AI dataset. It is constructed based on three large publicly available datasets: Flickr8K, COCO, and Places205, by adding their corresponding machine-generated pairs from fi…
View article: Predicting Long-horizon Futures by Conditioning on Geometry and Time
Predicting Long-horizon Futures by Conditioning on Geometry and Time Open
Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video…
View article: Evaluating Text-to-Visual Generation with Image-to-Text Generation
Evaluating Text-to-Visual Generation with Image-to-Text Generation Open
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (gen…
View article: Better Call SAL: Towards Learning to Segment Anything in Lidar
Better Call SAL: Towards Learning to Segment Anything in Lidar Open
We propose the SAL (Segment Anything in Lidar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervisi…
View article: I Can't Believe It's Not Scene Flow!
I Can't Believe It's Not Scene Flow! Open
Current scene flow methods broadly fail to describe motion on small objects, and current scene flow evaluation protocols hide this failure by averaging over many points, with most drawn larger objects. To fix this evaluation failure, we pr…
View article: Cameras as Rays: Pose Estimation via Ray Diffusion
Cameras as Rays: Pose Estimation via Ray Diffusion Open
Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrins…
View article: FlashTex: Fast Relightable Mesh Texturing with LightControlNet
FlashTex: Fast Relightable Mesh Texturing with LightControlNet Open
Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach di…
View article: Improving Model's Interpretability and Reliability using Biomarkers
Improving Model's Interpretability and Reliability using Biomarkers Open
Accurate and interpretable diagnostic models are crucial in the safety-critical field of medicine. We investigate the interpretability of our proposed biomarker-based lung ultrasound diagnostic pipeline to enhance clinicians' diagnostic ca…
View article: The Neglected Tails in Vision-Language Models
The Neglected Tails in Vision-Language Models Open
Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 1…
View article: Fast and Modular Autonomy Software for Autonomous Racing Vehicles
Fast and Modular Autonomy Software for Autonomous Racing Vehicles Open
Autonomous motorsports aim to replicate the human racecar driver with\nsoftware and sensors. As in traditional motorsports, Autonomous Racing Vehicles\n(ARVs) are pushed to their handling limits in multi-agent scenarios at\nextremely high …
View article: Revisiting Few-Shot Object Detection with Vision-Language Models
Revisiting Few-Shot Object Detection with Vision-Language Models Open
The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundati…
View article: TAO-Amodal: A Benchmark for Tracking Any Object Amodally
TAO-Amodal: A Benchmark for Tracking Any Object Amodally Open
Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of hea…
View article: Long-Tailed 3D Detection via Multi-Modal Fusion
Long-Tailed 3D Detection via Multi-Modal Fusion Open
Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors. While class labels naturally follow a long-tailed distribution in the real world, existing benchmarks only focus on a few common classes (e…