Piotr Dollár
YOU?
Author Swipe
View article: SAM 3: Segment Anything with Concepts
SAM 3: Segment Anything with Concepts Open
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars,…
View article: SAM 3D: 3Dfy Anything in Images
SAM 3D: 3Dfy Anything in Images Open
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual re…
View article: SAM 2: Segment Anything in Images and Videos
SAM 2: Segment Anything in Images and Videos Open
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video s…
View article: RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection
RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection Open
This repository contains all the collected and aligned data for RU-AI dataset. It is constructed based on three large publicly available datasets: Flickr8K, COCO, and Places205, by adding their corresponding machine-generated pairs from fi…
View article: LVIS: A dataset for large vocabulary instance segmentation
LVIS: A dataset for large vocabulary instance segmentation Open
Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we …
View article: Segment Anything
Segment Anything Open
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion mas…
View article: The effectiveness of MAE pre-pretraining for billion-scale pretraining
The effectiveness of MAE pre-pretraining for billion-scale pretraining Open
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billion…
View article: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
Revisiting Weakly Supervised Pre-Training of Visual Perception Models Open
Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pre-traini…
View article: Benchmarking Detection Transfer Learning with Vision Transformers
Benchmarking Detection Transfer Learning with Vision Transformers Open
Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial wh…
View article: Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners Open
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core de…
View article: Early Convolutions Help Transformers See Better
Early Convolutions Help Transformers See Better Open
Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional …
View article: Fast and Accurate Model Scaling
Fast and Accurate Model Scaling Open
In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example sc…
View article: Boundary IoU: Improving Object-Centric Image Segmentation Evaluation
Boundary IoU: Improving Object-Centric Image Segmentation Evaluation Open
We present Boundary IoU (Intersection-over-Union), a new segmentation evaluation measure focused on boundary quality. We perform an extensive analysis across different error types and object sizes and show that Boundary IoU is significantl…
View article: Fast and Accurate Model Scaling
Fast and Accurate Model Scaling Open
In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example sc…
View article: Evaluating Large-Vocabulary Object Detectors: The Devil is in the\n Details
Evaluating Large-Vocabulary Object Detectors: The Devil is in the\n Details Open
By design, average precision (AP) for object detection aims to treat all\nclasses independently: AP is computed independently per category and averaged.\nOn one hand, this is desirable as it treats all classes equally. On the other\nhand, …
View article: Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details
Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details Open
By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged. On one hand, this is desirable as it treats all classes equally. On the other hand, it …
View article: Designing Network Design Spaces
Designing Network Design Spaces Open
In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network ins…
View article: Are Labels Necessary for Neural Architecture Search?
Are Labels Necessary for Neural Architecture Search? Open
Existing neural network architectures in computer vision -- whether designed by humans or by machines -- were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neura…
View article: On Network Design Spaces for Visual Recognition
On Network Design Spaces for Visual Recognition Open
Over the past several years progress in designing better neural network architectures for visual recognition has been substantial. To help sustain this rate of progress, in this work we propose to reexamine the methodology for comparing ne…
View article: Panoptic Segmentation
Panoptic Segmentation Open
We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each obj…
View article: TensorMask: A Foundation for Dense Object Segmentation
TensorMask: A Foundation for Dense Object Segmentation Open
Sliding-window object detectors that generate bounding-box object predictions over a dense, regular grid have advanced rapidly and proven popular. In contrast, modern instance segmentation approaches are dominated by methods that first det…
View article: Panoptic Feature Pyramid Networks
Panoptic Feature Pyramid Networks Open
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art met…
View article: Rethinking ImageNet Pre-training
Rethinking ImageNet Pre-training Open
We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when usin…
View article: Data Distillation: Towards Omni-Supervised Learning
Data Distillation: Towards Omni-Supervised Learning Open
We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by perf…
View article: Learning to Segment Every Thing
Learning to Segment Every Thing Open
Most methods for object instance segmentation require all training examples to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ~100 well-…
View article: Focal Loss for Dense Object Detection
Focal Loss for Dense Object Detection Open
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a reg…
View article: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour Open
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential…
View article: Detecting and Recognizing Human-Object Interactions
Detecting and Recognizing Human-Object Interactions Open
To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical…
View article: Mask R-CNN
Mask R-CNN Open
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. Th…
View article: Mask R-CNN
Mask R-CNN Open
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. Th…