Explanipedia

MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning Open

Mattia Segù, Marta Tintore Gazulla, Yongqin Xian, Luc Van Gool, Federico Tombari · 2025

Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computationa…

UIP2P: Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint Open

Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari · 2024

We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. Existing methods rely on supervised learning with triplets of input images, ground-truth edited image…

Active Data Curation Effectively Distills Large-Scale Multimodal Models Open

Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie , et al. · 2024

Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inh…

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters Open

Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen , et al. · 2024

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises …

Toward a Diffusion-Based Generalist for Dense Vision Tasks Open

Yue Fan, Yongqin Xian, Xiaohua Zhai, А. И. Колесников, Muhammad Ferjad Naeem , et al. · 2024

Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated…

LocCa: Visual Pretraining with Location-aware Captioners Open

Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin , et al. · 2024

Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, w…

Text-Conditioned Resampler For Long Form Video Understanding Open

Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari · 2023

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features fr…

LIME: Localized Image Editing via Attention Regularization in Diffusion Models Open

Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari · 2023

Diffusion models (DMs) have gained prominence due to their ability to generate high-quality varied images with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A signif…

PALM: Predicting Actions through Language Models Open

Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool , et al. · 2023

Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional methods heavily rely on representation learning that i…

SILC: Improving Vision Language Pretraining with Self-Distillation Open

Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool , et al. · 2023

Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for…

Learning Prototype Classifiers for Long-Tailed Recognition Open

Saurabh Sharma, Yongqin Xian, Yu Ning, Ambuj K. Singh · 2023

The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that are biased in that they c…

Detecting Adversarial Faces Using Only Real Face Self-Perturbations Open

Qian Wang, Yongqin Xian, Hefei Ling, Jinyuan Zhang, Xiaorui Lin , et al. · 2023

Adversarial attacks aim to disturb the functionality of a target system by adding specific noise to the input samples, bringing potential threats to security and robustness when applied to facial recognition systems. Although existing defe…

Detecting Adversarial Faces Using Only Real Face Self-Perturbations Open

Qian Wang, Yongqin Xian, Hefei Ling, Jinyuan Zhang, Xiaorui Lin , et al. · 2023

Adversarial attacks aim to disturb the functionality of a target system by adding specific noise to the input samples, bringing potential threats to security and robustness when applied to facial recognition systems. Although existing defe…

Learning Prototype Classifiers for Long-Tailed Recognition Open

Saurabh Sharma, Yongqin Xian, Yu Ning, Ambuj K. Singh · 2023

The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that are biased in that they c…

Urban Scene Semantic Segmentation with Low-Cost Coarse Annotation Open

Anurag K. Das, Yongqin Xian, Yang He, Zeynep Akata, Bernt Schiele · 2022

For best performance, today's semantic segmentation methods use large and carefully labeled datasets, requiring expensive annotation budgets. In this work, we show that coarse annotation is a low-cost but highly effective alternative for t…

CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution Open

Jiezhang Cao, Qin Wang, Yongqin Xian, Yawei Li, Bingbing Ni , et al. · 2022

Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly…

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification Open

Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker , et al. · 2022

Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and …

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification Open

Muhammad Ferjad Naeem, Yongqin Xian, Luc Van Gool, Federico Tombari · 2022

Despite the tremendous progress in zero-shot learning(ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using …

Attribute Prototype Network for Any-Shot Learning Open

Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata · 2022

Any-shot image classification allows to recognize novel classes with only a few or even zero samples. For the task of zero-shot learning, visual attributes have been shown to play an important role, while in the few-shot regime, the effect…

Learning Graph Embeddings for Open World Compositional Zero-Shot Learning Open

Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, Zeynep Akata · 2022

Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available…

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning Open

Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata · 2022

Human-annotated attributes serve as powerful semantic embeddings in zero-shot learning. However, their annotation process is labor-intensive and needs expert supervision. Current unsupervised semantic embeddings, i.e., word embeddings, ena…

3D Compositional Zero-Shot Learning with DeCompositional Consensus Open

Muhammad Ferjad Naeem, Evin Pınar Örnek, Yongqin Xian, Luc Van Gool, Federico Tombari · 2022

3D Compositional Zero-shot Learning with DeCompositional Consensus Open

Muhammad Ferjad Naeem, Evin Pınar Örnek, Yongqin Xian, Luc Van Gool, Federico Tombari · 2021

Parts represent a basic unit of geometric and semantic similarity across different objects. We argue that part knowledge should be composable beyond the observed object classes. Towards this, we present 3D Compositional Zero-shot Learning …

Open World Compositional Zero-Shot Learning Open

Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, Zeynep Akata · 2021

Compositional Zero-Shot learning (CZSL) requires to recognize state-object compositions unseen during training. In this work, instead of assuming prior knowledge about the unseen compositions, we operate in the open world setting, where th…

Learning Graph Embeddings for Compositional Zero-shot Learning Open

Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, Zeynep Akata · 2021

In compositional zero-shot learning, the goal is to recognize unseen compositions (e.g. old dog) of observed visual primitives states (e.g. old, cute) and objects (e.g. car, dog) in the training set. This is challenging because the same st…

A Closer Look at Self-training for Zero-Label Semantic Segmentation Open

Giuseppe Pastore, Fabio Cermelli, Yongqin Xian, Massimiliano Mancini, Zeynep Akata , et al. · 2021

Being able to segment unseen classes not observed during training is an important technical challenge in deep learning, because of its potential to reduce the expensive annotation required for semantic segmentation. Prior zero-label semant…

Learning Graph Embeddings for Open World Compositional Zero-Shot Learning Open

Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, Zeynep Akata · 2021

Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available…

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning Open

Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata · 2021

Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even tho…

Prototype-based Incremental Few-Shot Semantic Segmentation Open

Fabio Cermelli, Massimiliano Mancini, Yongqin Xian, Zeynep Akata, Barbara Caputo · 2020

Semantic segmentation models have two fundamental weaknesses: i) they require large training sets with costly pixel-level annotations, and ii) they have a static output space, constrained to the classes of the training set. Toward addressi…

A Few Guidelines for Incremental Few-Shot Segmentation. Open

Fabio Cermelli, Massimiliano Mancini, Yongqin Xian, Zeynep Akata, Barbara Caputo · 2020

Reducing the amount of supervision required by neural networks is especially important in the context of semantic segmentation, where collecting dense pixel-level annotations is particularly expensive. In this paper, we address this proble…

Yongqin Xian YOU? Author Swipe