Explanipedia

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows Open

Jiatao Gu, Ying Shen, Tianrong Chen, Yuyang Wang, David Berthelot , et al. · 2025

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal com…

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation Open

Timy Phan, Josh Susskind, Björn Ommer · 2025

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fi…

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers Open

Xiangde Li, Chen Huang, Chunliang Li, Eran Malach, Josh Susskind , et al. · 2025

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents repres…

SimpleFold: Folding Proteins is Simpler than You Think Open

Yuyang Wang, Jingsheng Lu, Navdeep Jaitly, Josh Susskind, Miguel Ángel Bautista · 2025

Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across differ…

Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows Open

Ruixiang Zhang, Shuangfei Zhai, Tianrong Chen, Miguel Ángel Bautista, Josh Susskind · 2025

Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of …

How PARTs assemble into wholes: Learning the relative composition of images Open

Melika Ayoughi, Samira Abnar, Chen Huang, Chris Sandino, Sayeri Lala , et al. · 2025

The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-superv…

Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting Open

Chen Huang, Skyler Seto, Hadi Pouransari, Mehrdad Farajtabar, Raviteja Vemulapalli , et al. · 2025

Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue o…

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Open

Samira Abnar, Harshay Shah, Dan Busbridge, A. Ali, Josh Susskind , et al. · 2025

Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the…

3D Shape Tokenization via Latent Flow Matching Open

Jen-Hao Rick Chang, Yuyang Wang, Miguel Angel Bautista Martin, Jiatao Gu, Josh Susskind , et al. · 2024

We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering co…

Normalizing Flows are Capable Generative Models Open

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu , et al. · 2024

Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In…

INRFlow: Flow Matching for INRs in Ambient Space Open

Yuyang Wang, Anurag Ranjan, Josh Susskind, Miguel Ángel Bautista · 2024

Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained …

TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models Open

Georgia Gabriela Sampaio, Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Josh Susskind , et al. · 2024

Evaluating text-to-image generative models remains a challenge, despite the remarkable progress being made in their overall performances. While existing metrics like CLIPScore work for coarse evaluations, they lack the sensitivity to disti…

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP Open

Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly , et al. · 2024

Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts…

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation Open

Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang , et al. · 2024

Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully u…

On the benefits of pixel-based hierarchical policies for task generalization Open

Tudor Cristea-Platon, Bogdan Mazoure, Josh Susskind, Walter Talbott · 2024

Reinforcement learning practitioners often avoid hierarchical policies, especially in image-based observation spaces. Typically, the single-task performance improvement over flat-policy counterparts does not justify the additional complexi…

Improving GFlowNets for Text-to-Image Diffusion Alignment Open

Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Josh Susskind , et al. · 2024

Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as align…

How Far Are We from Intelligent Visual Deductive Reasoning? Open

Yizhe Zhang, He Bai, Ruixiang Zhang, Jiatao Gu, Shuangfei Zhai , et al. · 2024

Vision-Language Models (VLMs) have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindsp…

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization Open

Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang · 2024

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concep…

What Algorithms can Transformers Learn? A Study in Length Generalization Open

Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi , et al. · 2023

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algo…

Matryoshka Diffusion Models Open

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly · 2023

Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to tra…

Adaptivity and Modularity for Efficient Generalization Over Task Complexity Open

Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Ángel Bautista , et al. · 2023

Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indi…

Generative Modeling with Phase Stochastic Bridges Open

Tianrong Chen, Jiatao Gu, Laurent Dinh, Evangelos A. Theodorou, Josh Susskind , et al. · 2023

Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. DMs work by constructing a Stochastic Differential Equation (SDE) in the input space (ie, position space), and using a neural network to reverse it.…

Boolformer: Symbolic Regression of Logic Functions with Transformers Open

Stéphane d’Ascoli, Samy Bengio, Josh Susskind, Emmanuel Abbé · 2023

We introduce Boolformer, a Transformer-based model trained to perform end-to-end symbolic regression of Boolean functions. First, we show that it can predict compact formulas for complex functions not seen during training, given their full…

Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation Open

Ali Mousavi, Xin Zhan, He Bai, Peng Shi, Theo Rekatsinas , et al. · 2023

Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivale…

Value function estimation using conditional diffusion models for control Open

Bogdan Mazoure, Walter Talbott, Miguel Ángel Bautista, Devon Hjelm, Alexander Toshev , et al. · 2023

A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative …

BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping Open

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, Josh Susskind · 2023

Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy t…

PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model Open

Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Josh Susskind , et al. · 2023

Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during the steps of generation. This issue is often attributed to exposure bias - the difference between how a model is trained, …

Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images Open

Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu , et al. · 2023

Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the o…

Stabilizing Transformer Training by Preventing Attention Entropy Collapse Open

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram , et al. · 2023

Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attent…

MAST: Masked Augmentation Subspace Training for Generalizable Self-Supervised Priors Open

Chen Huang, Hanlin Goh, Jiatao Gu, Josh Susskind · 2023

Recent Self-Supervised Learning (SSL) methods are able to learn feature representations that are invariant to different data augmentations, which can then be transferred to downstream tasks of interest. However, different downstream tasks …

Josh Susskind YOU? Author Swipe