Explanipedia

DiSA: Diffusion Step Annealing in Autoregressive Image Generation Open

Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Jay Gould , et al. · 2025

An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 …

Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints Open

Ming Xu, Stephen Jay Gould, Iman Shames · 2025

We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian cl…

Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data Open

Yiqun Duan, Sameera Ramasinghe, Stephen Jay Gould, Thalaiyasingam Ajanthan · 2025

Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have reli…

ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models Open

Qinyu Zhao, Stephen Jay Gould, Liang Zheng · 2025

Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model th…

Negative Token Merging: Image-based Adversarial Feature Guidance Open

Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi , et al. · 2024

Computer science Philosophy

Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While useful, performing adversarial guidance using text alone can be insuff…

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams Open

Jiahao Zhang, Anoop Cherian, Cristian Rodríguez Rivero, Weijian Deng, Stephen Jay Gould · 2024

Computer science Psychology Engineering

Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinato…

Guiding Neural Collapse: Optimising Towards the Nearest Simplex Equiangular Tight Frame Open

Evan Markou, Thalaiyasingam Ajanthan, Stephen Jay Gould · 2024

Computer science Mathematics

Neural Collapse (NC) is a recently observed phenomenon in neural networks that characterises the solution space of the final classifier layer when trained until zero training loss. Specifically, NC suggests that the final classifier layer …

Neural Experts: Mixture of Experts for Implicit Neural Representations Open

Yizhak Ben-Shabat, Chamin Hewa Koneputugodage, Sameera Ramasinghe, Stephen Jay Gould · 2024

Computer science Psychology

Implicit neural representations (INRs) have proven effective in various tasks including image, shape, audio, and video reconstruction. These INRs typically learn the implicit field from sampled input points. This is often done using a sing…

Can We Predict Performance of Large Models across Vision-Language Tasks? Open

Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng , et al. · 2024

Computer science

Evaluating large vision-language models (LVLMs) is very expensive, due to high computational cost and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown one…

Temporally Grounding Instructional Diagrams in Unconstrained Videos Open

Jiahao Zhang, Frederic Z. Zhang, Cristian Rodríguez Rivero, Yizhak Ben-Shabat, Anoop Cherian , et al. · 2024

Computer science Engineering

We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, m…

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation Open

Ming Xu, Stephen Jay Gould · 2024

Computer science Physics

We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temp…

The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? Open

Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng , et al. · 2024

Computer science Psychology

Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hid…

An Empirical Study Into What Matters for Calibrating Vision-Language Models Open

Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Jay Gould, Tom Gedeon · 2024

Computer science Psychology Economics

Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper un…

Towards Optimal Feature-Shaping Methods for Out-of-Distribution Detection Open

Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng , et al. · 2024

Computer science Mathematics Philosophy

Feature shaping refers to a family of methods that exhibit state-of-the-art performance for out-of-distribution (OOD) detection. These approaches manipulate the feature representation, typically from the penultimate layer of a pre-trained …

Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines Open

Hamed Damirchi, Cristian Rodríguez-Opazo, Ehsan Abbasnejad, Damien Teney, Qinfeng Shi , et al. · 2023

Computer science Philosophy Chemistry

Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to exce…

View-coherent correlation consistency for semi-supervised semantic segmentation Open

Yunzhong Hou, Stephen Jay Gould, Liang Zheng · 2023

Computer science Mathematics Philosophy

Semi-supervised semantic segmentation needs rich and robust supervision for unlabeled data. However, promoting or punishing feature similarities with vanilla contrastive learning can be unreliable for semi-supervised semantic segmentation:…

Revisiting Implicit Differentiation for Learning Problems in Optimal Control Open

Ming Xu, Timothy L. Molloy, Stephen Jay Gould · 2023

Computer science Mathematics Geography

This paper proposes a new method for differentiating through optimal trajectories arising from non-convex, constrained discrete-time optimal control (COC) problems using the implicit function theorem (IFT). Previous works solve a different…

3D-GPT: Procedural 3D Modeling with Large Language Models Open

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin , et al. · 2023

Computer science Engineering

In the pursuit of efficient automated content creation, procedural generation, leveraging modifiable parameters and rule-based systems, emerges as a promising approach. Nonetheless, it could be a demanding endeavor, given its intricate nat…

Exploring Predicate Visual Context in Detecting Human-Object Interactions Open

Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Jay Gould · 2023

Computer science Physics

Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. How…

Scaling Data Generation in Vision-and-Language Navigation Open

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu , et al. · 2023

Computer science Physics Psychology

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity …

Learning Navigational Visual Representations with Semantic Map Supervision Open

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui , et al. · 2023

Computer science Psychology

Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images…

PMaF: Deep Declarative Layers for Principal Matrix Features Open

Zhiwei Xu, Hao Wang, Yanbin Liu, Stephen Jay Gould · 2023

Computer science Mathematics Engineering

We explore two differentiable deep declarative layers, namely least squares on sphere (LESS) and implicit eigen decomposition (IED), for learning the principal matrix features (PMaF). It can be used to represent data features with a low-di…

Towards Understanding Gradient Approximation in Equality Constrained Deep Declarative Networks Open

Stephen Jay Gould, Ming Xu, Zhiwei Xu, Yanbin Liu · 2023

Computer science Mathematics Sociology

We explore conditions for when the gradient of a deep declarative node can be approximated by ignoring constraint terms and still result in a descent direction for the global loss function. This has important practical application when tra…

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder Open

Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Jay Gould · 2023

Computer science Biology Economics

Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these …

GoferBot: A Visual Guided Human-Robot Collaborative Assembly System Open

Zheyu Zhuang, Yizhak Ben-Shabat, Jiahao Zhang, Stephen Jay Gould, Robert Mahony · 2023

Computer science Engineering Chemistry

The current transformation towards smart manufacturing has led to a growing demand for human-robot collaboration (HRC) in the manufacturing process. Perceiving and understanding the human co-worker's behaviour introduces challenges for col…

Adaptive Cross Batch Normalization for Metric Learning Open

Thalaiyasingam Ajanthan, Matt Ma, Anton van den Hengel, Stephen Jay Gould · 2023

Computer science Mathematics Physics

Metric learning is a fundamental problem in computer vision whereby a model is trained to learn a semantically useful embedding space via ranking losses. Traditionally, the effectiveness of a ranking loss depends on the minibatch size, and…

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning Open

Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, Stephen Jay Gould · 2023

Computer science Mathematics Economics

Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mappin…

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations Open

Jiahao Zhang, Anoop Cherian, Yanbin Liu, Yizhak Ben-Shabat, Cristian Rodríguez Rivero , et al. · 2023

Computer science

Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly dia…

Deep Declarative Dynamic Time Warping for End-to-End Learning of Alignment Paths Open

Ming Xu, Sourav Garg, Michael Milford, Stephen Jay Gould · 2023

Computer science Mathematics Economics

This paper addresses learning end-to-end models for time series data that include a temporal alignment step via dynamic time warping (DTW). Existing approaches to differentiable DTW either differentiate through a fixed warping path or appl…

Stephen Jay Gould YOU? Author Swipe