Mustafa Shukor
YOU?
Author Swipe
View article: Learning to Steer: Input-dependent Steering for Multimodal LLMs
Learning to Steer: Input-dependent Steering for Multimodal LLMs Open
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such …
View article: Scaling Laws for Optimal Data Mixtures
Scaling Laws for Optimal Data Mixtures Open
Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…
View article: FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models Open
Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leadin…
View article: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Open
Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches ad…
View article: Scaling Laws for Native Multimodal Models
Scaling Laws for Native Multimodal Models Open
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders …
View article: Analyzing Finetuning Representation Shift for Multimodal LLMs Steering
Analyzing Finetuning Representation Shift for Multimodal LLMs Steering Open
Multimodal LLMs (MLLMs) have reached remarkable levels of proficiency in understanding multimodal inputs. However, understanding and interpreting the behavior of such complex models is a challenging task, not to mention the dynamic shifts …
View article: Multimodal Autoregressive Pre-training of Large Vision Encoders
Multimodal Autoregressive Pre-training of Large Vision Encoders Open
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this …
View article: A Concept-Based Explainability Framework for Large Multimodal Models
A Concept-Based Explainability Framework for Large Multimodal Models Open
Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs…
View article: DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut
DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut Open
Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In…
View article: Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs Open
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the building block for Large Multimodal Models, yet, we still lack a proper understanding of their succe…
View article: What Makes Multimodal In-Context Learning Work?
What Makes Multimodal In-Context Learning Work? Open
Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we pre…
View article: Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs Open
The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image…
View article: Efficient adaptation of Foundation Models for Visual Grounding Remote Sensing task
Efficient adaptation of Foundation Models for Visual Grounding Remote Sensing task Open
Foundation models have demonstrated impressive proficiency across multiple domains, including language, vision, and multi-modal applications, establishing new standards for efficiency and adaptability. In the context of localization-based …
View article: Empirical Study of PEFT Techniques for Winter-Wheat Segmentation
Empirical Study of PEFT Techniques for Winter-Wheat Segmentation Open
Parameter Efficient Fine-Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with mi…
View article: Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation
Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation Open
Current AI-based methods do not provide comprehensible physical interpretations of the utilized data, extracted features, and predictions/inference operations. As a result, deep learning models trained using high-resolution satellite image…
View article: Empirical Study of PEFT techniques for Winter Wheat Segmentation
Empirical Study of PEFT techniques for Winter Wheat Segmentation Open
Parameter Efficient Fine Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with mi…
View article: Zero-Shot Refinement of Buildings' Segmentation Models using SAM
Zero-Shot Refinement of Buildings' Segmentation Models using SAM Open
Foundation models have excelled in various tasks but are often evaluated on general benchmarks. The adaptation of these models for specific domains, such as remote sensing imagery, remains an underexplored area. In remote sensing, precise …
View article: Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning Open
Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with …
View article: UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks Open
Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising …
View article: Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards Open
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfect…
View article: eP-ALM: Efficient Perceptual Augmentation of Language Models
eP-ALM: Efficient Perceptual Augmentation of Language Models Open
Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best perfor…
View article: Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval Open
View article: Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval Open
Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking …
View article: Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Open
Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem re…
View article: Video Coding Using Learned Latent GAN Compression
Video Coding Using Learned Latent GAN Compression Open
We propose in this paper a new paradigm for facial video compression. We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video, including intra and inter compression. Each frame is inverted in the late…
View article: Semantic Unfolding of StyleGAN Latent Space
Semantic Unfolding of StyleGAN Latent Space Open
Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to an input real image. This editing property emerges from the disentangled natur…
View article: Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval Open
Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside …
View article: Buildings Classification using Very High Resolution Satellite Imagery
Buildings Classification using Very High Resolution Satellite Imagery Open
Buildings classification using satellite images is becoming more important for several applications such as damage assessment, resource allocation, and population estimation. We focus, in this work, on buildings damage assessment (BDA) and…
View article: Sci-Net: Scale Invariant Model for Buildings Segmentation from Aerial Imagery
Sci-Net: Scale Invariant Model for Buildings Segmentation from Aerial Imagery Open
Buildings' segmentation is a fundamental task in the field of earth observation and aerial imagery analysis. Most existing deep learning-based methods in the literature can be applied to a fixed or narrow-range spatial resolution imagery. …
View article: Synthetic training data generation for deep learning based quality inspection
Synthetic training data generation for deep learning based quality inspection Open
Deep learning is now the gold standard in computer vision-based quality inspection systems. In order to detect defects, supervised learning is often utilized, but necessitates a large amount of annotated images, which can be costly: collec…