Josh Susskind
YOU?
Author Swipe
View article: STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows Open
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal com…
View article: Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation Open
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fi…
View article: Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers
Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers Open
Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents repres…
View article: SimpleFold: Folding Proteins is Simpler than You Think
SimpleFold: Folding Proteins is Simpler than You Think Open
Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across differ…
View article: Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows Open
Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of …
View article: How PARTs assemble into wholes: Learning the relative composition of images
How PARTs assemble into wholes: Learning the relative composition of images Open
The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-superv…
View article: Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting Open
Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue o…
View article: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Open
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the…
View article: 3D Shape Tokenization via Latent Flow Matching
3D Shape Tokenization via Latent Flow Matching Open
We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering co…
View article: Normalizing Flows are Capable Generative Models
Normalizing Flows are Capable Generative Models Open
Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In…
View article: INRFlow: Flow Matching for INRs in Ambient Space
INRFlow: Flow Matching for INRs in Ambient Space Open
Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained …
View article: TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models
TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models Open
Evaluating text-to-image generative models remains a challenge, despite the remarkable progress being made in their overall performances. While existing metrics like CLIPScore work for coarse evaluations, they lack the sensitivity to disti…
View article: Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP Open
Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts…
View article: DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation Open
Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully u…
View article: On the benefits of pixel-based hierarchical policies for task generalization
On the benefits of pixel-based hierarchical policies for task generalization Open
Reinforcement learning practitioners often avoid hierarchical policies, especially in image-based observation spaces. Typically, the single-task performance improvement over flat-policy counterparts does not justify the additional complexi…
View article: Improving GFlowNets for Text-to-Image Diffusion Alignment
Improving GFlowNets for Text-to-Image Diffusion Alignment Open
Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as align…
View article: How Far Are We from Intelligent Visual Deductive Reasoning?
How Far Are We from Intelligent Visual Deductive Reasoning? Open
Vision-Language Models (VLMs) have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindsp…
View article: Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization Open
Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concep…
View article: What Algorithms can Transformers Learn? A Study in Length Generalization
What Algorithms can Transformers Learn? A Study in Length Generalization Open
Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algo…
View article: Matryoshka Diffusion Models
Matryoshka Diffusion Models Open
Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to tra…
View article: Adaptivity and Modularity for Efficient Generalization Over Task Complexity
Adaptivity and Modularity for Efficient Generalization Over Task Complexity Open
Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indi…
View article: Generative Modeling with Phase Stochastic Bridges
Generative Modeling with Phase Stochastic Bridges Open
Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. DMs work by constructing a Stochastic Differential Equation (SDE) in the input space (ie, position space), and using a neural network to reverse it.…
View article: Boolformer: Symbolic Regression of Logic Functions with Transformers
Boolformer: Symbolic Regression of Logic Functions with Transformers Open
We introduce Boolformer, a Transformer-based model trained to perform end-to-end symbolic regression of Boolean functions. First, we show that it can predict compact formulas for complex functions not seen during training, given their full…
View article: Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation
Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation Open
Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivale…
View article: Value function estimation using conditional diffusion models for control
Value function estimation using conditional diffusion models for control Open
A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative …
View article: BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping Open
Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy t…
View article: PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model
PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model Open
Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during the steps of generation. This issue is often attributed to exposure bias - the difference between how a model is trained, …
View article: Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images
Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images Open
Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the o…
View article: Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Stabilizing Transformer Training by Preventing Attention Entropy Collapse Open
Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attent…
View article: MAST: Masked Augmentation Subspace Training for Generalizable Self-Supervised Priors
MAST: Masked Augmentation Subspace Training for Generalizable Self-Supervised Priors Open
Recent Self-Supervised Learning (SSL) methods are able to learn feature representations that are invariant to different data augmentations, which can then be transferred to downstream tasks of interest. However, different downstream tasks …