Stephen Jay Gould
YOU?
Author Swipe
View article: DiSA: Diffusion Step Annealing in Autoregressive Image Generation
DiSA: Diffusion Step Annealing in Autoregressive Image Generation Open
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 …
View article: Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints
Line-Search Filter Differential Dynamic Programming for Optimal Control with Nonlinear Equality Constraints Open
We present FilterDDP, a differential dynamic programming algorithm for solving discrete-time, optimal control problems (OCPs) with nonlinear equality constraints. Unlike prior methods based on merit functions or the augmented Lagrangian cl…
View article: Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data
Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data Open
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have reli…
View article: ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models
ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models Open
Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model th…
View article: Negative Token Merging: Image-based Adversarial Feature Guidance
Negative Token Merging: Image-based Adversarial Feature Guidance Open
Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While useful, performing adversarial guidance using text alone can be insuff…
View article: Manual-PA: Learning 3D Part Assembly from Instruction Diagrams
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams Open
Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinato…
View article: Guiding Neural Collapse: Optimising Towards the Nearest Simplex Equiangular Tight Frame
Guiding Neural Collapse: Optimising Towards the Nearest Simplex Equiangular Tight Frame Open
Neural Collapse (NC) is a recently observed phenomenon in neural networks that characterises the solution space of the final classifier layer when trained until zero training loss. Specifically, NC suggests that the final classifier layer …
View article: Neural Experts: Mixture of Experts for Implicit Neural Representations
Neural Experts: Mixture of Experts for Implicit Neural Representations Open
Implicit neural representations (INRs) have proven effective in various tasks including image, shape, audio, and video reconstruction. These INRs typically learn the implicit field from sampled input points. This is often done using a sing…
View article: Can We Predict Performance of Large Models across Vision-Language Tasks?
Can We Predict Performance of Large Models across Vision-Language Tasks? Open
Evaluating large vision-language models (LVLMs) is very expensive, due to high computational cost and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown one…
View article: Temporally Grounding Instructional Diagrams in Unconstrained Videos
Temporally Grounding Instructional Diagrams in Unconstrained Videos Open
We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, m…
View article: Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation Open
We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temp…
View article: The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? Open
Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hid…
View article: An Empirical Study Into What Matters for Calibrating Vision-Language Models
An Empirical Study Into What Matters for Calibrating Vision-Language Models Open
Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper un…
View article: Towards Optimal Feature-Shaping Methods for Out-of-Distribution Detection
Towards Optimal Feature-Shaping Methods for Out-of-Distribution Detection Open
Feature shaping refers to a family of methods that exhibit state-of-the-art performance for out-of-distribution (OOD) detection. These approaches manipulate the feature representation, typically from the penultimate layer of a pre-trained …
View article: Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines
Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines Open
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to exce…
View article: View-coherent correlation consistency for semi-supervised semantic segmentation
View-coherent correlation consistency for semi-supervised semantic segmentation Open
Semi-supervised semantic segmentation needs rich and robust supervision for unlabeled data. However, promoting or punishing feature similarities with vanilla contrastive learning can be unreliable for semi-supervised semantic segmentation:…
View article: Revisiting Implicit Differentiation for Learning Problems in Optimal Control
Revisiting Implicit Differentiation for Learning Problems in Optimal Control Open
This paper proposes a new method for differentiating through optimal trajectories arising from non-convex, constrained discrete-time optimal control (COC) problems using the implicit function theorem (IFT). Previous works solve a different…
View article: 3D-GPT: Procedural 3D Modeling with Large Language Models
3D-GPT: Procedural 3D Modeling with Large Language Models Open
In the pursuit of efficient automated content creation, procedural generation, leveraging modifiable parameters and rule-based systems, emerges as a promising approach. Nonetheless, it could be a demanding endeavor, given its intricate nat…
View article: Exploring Predicate Visual Context in Detecting Human-Object Interactions
Exploring Predicate Visual Context in Detecting Human-Object Interactions Open
Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. How…
View article: Scaling Data Generation in Vision-and-Language Navigation
Scaling Data Generation in Vision-and-Language Navigation Open
Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity …
View article: Learning Navigational Visual Representations with Semantic Map Supervision
Learning Navigational Visual Representations with Semantic Map Supervision Open
Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images…
View article: PMaF: Deep Declarative Layers for Principal Matrix Features
PMaF: Deep Declarative Layers for Principal Matrix Features Open
We explore two differentiable deep declarative layers, namely least squares on sphere (LESS) and implicit eigen decomposition (IED), for learning the principal matrix features (PMaF). It can be used to represent data features with a low-di…
View article: Towards Understanding Gradient Approximation in Equality Constrained Deep Declarative Networks
Towards Understanding Gradient Approximation in Equality Constrained Deep Declarative Networks Open
We explore conditions for when the gradient of a deep declarative node can be approximated by ignoring constraint terms and still result in a descent direction for the global loss function. This has important practical application when tra…
View article: Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder Open
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these …
View article: GoferBot: A Visual Guided Human-Robot Collaborative Assembly System
GoferBot: A Visual Guided Human-Robot Collaborative Assembly System Open
The current transformation towards smart manufacturing has led to a growing demand for human-robot collaboration (HRC) in the manufacturing process. Perceiving and understanding the human co-worker's behaviour introduces challenges for col…
View article: Adaptive Cross Batch Normalization for Metric Learning
Adaptive Cross Batch Normalization for Metric Learning Open
Metric learning is a fundamental problem in computer vision whereby a model is trained to learn a semantically useful embedding space via ranking losses. Traditionally, the effectiveness of a ranking loss depends on the minibatch size, and…
View article: Bi-directional Training for Composed Image Retrieval via Text Prompt Learning
Bi-directional Training for Composed Image Retrieval via Text Prompt Learning Open
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mappin…
View article: Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations Open
Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly dia…
View article: Deep Declarative Dynamic Time Warping for End-to-End Learning of Alignment Paths
Deep Declarative Dynamic Time Warping for End-to-End Learning of Alignment Paths Open
This paper addresses learning end-to-end models for time series data that include a temporal alignment step via dynamic time warping (DTW). Existing approaches to differentiable DTW either differentiate through a fixed warping path or appl…