Yongqin Xian
YOU?
Author Swipe
View article: MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning Open
Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computationa…
View article: UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint
UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint Open
We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. Existing methods rely on supervised learning with triplets of input images, ground-truth edited image…
View article: Active Data Curation Effectively Distills Large-Scale Multimodal Models
Active Data Curation Effectively Distills Large-Scale Multimodal Models Open
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inh…
View article: TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters Open
Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises …
View article: Toward a Diffusion-Based Generalist for Dense Vision Tasks
Toward a Diffusion-Based Generalist for Dense Vision Tasks Open
Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated…
View article: LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners Open
Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, w…
View article: Text-Conditioned Resampler For Long Form Video Understanding
Text-Conditioned Resampler For Long Form Video Understanding Open
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features fr…
View article: LIME: Localized Image Editing via Attention Regularization in Diffusion Models
LIME: Localized Image Editing via Attention Regularization in Diffusion Models Open
Diffusion models (DMs) have gained prominence due to their ability to generate high-quality varied images with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A signif…
View article: PALM: Predicting Actions through Language Models
PALM: Predicting Actions through Language Models Open
Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional methods heavily rely on representation learning that i…
View article: SILC: Improving Vision Language Pretraining with Self-Distillation
SILC: Improving Vision Language Pretraining with Self-Distillation Open
Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for…
View article: Learning Prototype Classifiers for Long-Tailed Recognition
Learning Prototype Classifiers for Long-Tailed Recognition Open
The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that are biased in that they c…
View article: Detecting Adversarial Faces Using Only Real Face Self-Perturbations
Detecting Adversarial Faces Using Only Real Face Self-Perturbations Open
Adversarial attacks aim to disturb the functionality of a target system by adding specific noise to the input samples, bringing potential threats to security and robustness when applied to facial recognition systems. Although existing defe…
View article: Detecting Adversarial Faces Using Only Real Face Self-Perturbations
Detecting Adversarial Faces Using Only Real Face Self-Perturbations Open
Adversarial attacks aim to disturb the functionality of a target system by adding specific noise to the input samples, bringing potential threats to security and robustness when applied to facial recognition systems. Although existing defe…
View article: Learning Prototype Classifiers for Long-Tailed Recognition
Learning Prototype Classifiers for Long-Tailed Recognition Open
The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that are biased in that they c…
View article: Urban Scene Semantic Segmentation with Low-Cost Coarse Annotation
Urban Scene Semantic Segmentation with Low-Cost Coarse Annotation Open
For best performance, today's semantic segmentation methods use large and carefully labeled datasets, requiring expensive annotation budgets. In this work, we show that coarse annotation is a low-cost but highly effective alternative for t…
View article: CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution
CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution Open
Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly…
View article: I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification Open
Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and …
View article: I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification
I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification Open
Despite the tremendous progress in zero-shot learning(ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using …
View article: Attribute Prototype Network for Any-Shot Learning
Attribute Prototype Network for Any-Shot Learning Open
Any-shot image classification allows to recognize novel classes with only a few or even zero samples. For the task of zero-shot learning, visual attributes have been shown to play an important role, while in the few-shot regime, the effect…
View article: Learning Graph Embeddings for Open World Compositional Zero-Shot Learning
Learning Graph Embeddings for Open World Compositional Zero-Shot Learning Open
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available…
View article: VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning Open
Human-annotated attributes serve as powerful semantic embeddings in zero-shot learning. However, their annotation process is labor-intensive and needs expert supervision. Current unsupervised semantic embeddings, i.e., word embeddings, ena…
View article: 3D Compositional Zero-Shot Learning with DeCompositional Consensus
3D Compositional Zero-Shot Learning with DeCompositional Consensus Open
View article: 3D Compositional Zero-shot Learning with DeCompositional Consensus
3D Compositional Zero-shot Learning with DeCompositional Consensus Open
Parts represent a basic unit of geometric and semantic similarity across different objects. We argue that part knowledge should be composable beyond the observed object classes. Towards this, we present 3D Compositional Zero-shot Learning …
View article: Open World Compositional Zero-Shot Learning
Open World Compositional Zero-Shot Learning Open
Compositional Zero-Shot learning (CZSL) requires to recognize state-object compositions unseen during training. In this work, instead of assuming prior knowledge about the unseen compositions, we operate in the open world setting, where th…
View article: Learning Graph Embeddings for Compositional Zero-shot Learning
Learning Graph Embeddings for Compositional Zero-shot Learning Open
In compositional zero-shot learning, the goal is to recognize unseen compositions (e.g. old dog) of observed visual primitives states (e.g. old, cute) and objects (e.g. car, dog) in the training set. This is challenging because the same st…
View article: A Closer Look at Self-training for Zero-Label Semantic Segmentation
A Closer Look at Self-training for Zero-Label Semantic Segmentation Open
Being able to segment unseen classes not observed during training is an important technical challenge in deep learning, because of its potential to reduce the expensive annotation required for semantic segmentation. Prior zero-label semant…
View article: Learning Graph Embeddings for Open World Compositional Zero-Shot Learning
Learning Graph Embeddings for Open World Compositional Zero-Shot Learning Open
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available…
View article: Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning Open
Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even tho…
View article: Prototype-based Incremental Few-Shot Semantic Segmentation
Prototype-based Incremental Few-Shot Semantic Segmentation Open
Semantic segmentation models have two fundamental weaknesses: i) they require large training sets with costly pixel-level annotations, and ii) they have a static output space, constrained to the classes of the training set. Toward addressi…
View article: A Few Guidelines for Incremental Few-Shot Segmentation.
A Few Guidelines for Incremental Few-Shot Segmentation. Open
Reducing the amount of supervision required by neural networks is especially important in the context of semantic segmentation, where collecting dense pixel-level annotations is particularly expensive. In this paper, we address this proble…