Explanipedia

Learning to Steer: Input-dependent Steering for Multimodal LLMs Open

Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Arnaud Dapogny, Alasdair Newson , et al. · 2025

Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such …

Scaling Laws for Optimal Data Mixtures Open

Mustafa Shukor, Louis Béthune, Dan Busbridge, David Grangier, Enrico Fini , et al. · 2025

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models Open

Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli , et al. · 2025

Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leadin…

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Open

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, P Kooijmans, Susana Palma , et al. · 2025

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches ad…

Scaling Laws for Native Multimodal Models Open

Mustafa Shukor, Enrico Fini, Victor G. Turrisi da Costa, Matthieu Cord, Joshua M. Susskind , et al. · 2025

Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders …

Analyzing Finetuning Representation Shift for Multimodal LLMs Steering Open

Pegah Khayatan, Mustafa Shukor, Jayneel Parekh, Matthieu Cord · 2025

Multimodal LLMs (MLLMs) have reached remarkable levels of proficiency in understanding multimodal inputs. However, understanding and interpreting the behavior of such complex models is a challenging task, not to mention the dynamic shifts …

Multimodal Autoregressive Pre-training of Large Vision Encoders Open

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein , et al. · 2024

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this …

A Concept-Based Explainability Framework for Large Multimodal Models Open

Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord · 2024

Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs…

DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut Open

Paul Couairon, Mustafa Shukor, Jean‐Emmanuel Haugeard, Matthieu Cord, Nicolas Thome · 2024

Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In…

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs Open

Mustafa Shukor, Matthieu Cord · 2024

Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the building block for Large Multimodal Models, yet, we still lack a proper understanding of their succe…

What Makes Multimodal In-Context Learning Work? Open

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski · 2024

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we pre…

Improved Baselines for Data-efficient Perceptual Augmentation of LLMs Open

Théophane Vallaeys, Mustafa Shukor, Matthieu Cord, Jakob Verbeek · 2024

The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image…

Efficient adaptation of Foundation Models for Visual Grounding Remote Sensing task Open

Ali J. Ghandour, Hasan Moughnieh, Mohammad Hasan Zahweh, Hasan Nasrallah, Mustafa Shukor , et al. · 2024

Foundation models have demonstrated impressive proficiency across multiple domains, including language, vision, and multi-modal applications, establishing new standards for efficiency and adaptability. In the context of localization-based …

Empirical Study of PEFT Techniques for Winter-Wheat Segmentation Open

Mohamad Hasan Zahweh, Hasan Nasrallah, Mustafa Shukor, Ghaleb Faour, Ali J. Ghandour · 2023

Parameter Efficient Fine-Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with mi…

Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation Open

Abdul Karim Gizzini, Mustafa Shukor, Ali J. Ghandour · 2023

Current AI-based methods do not provide comprehensible physical interpretations of the utilized data, extracted features, and predictions/inference operations. As a result, deep learning models trained using high-resolution satellite image…

Empirical Study of PEFT techniques for Winter Wheat Segmentation Open

Mohamad Hasan Zahweh, Hasan Nasrallah, Mustafa Shukor, Ghaleb Faour, Ali J. Ghandour · 2023

Parameter Efficient Fine Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with mi…

Zero-Shot Refinement of Buildings' Segmentation Models using SAM Open

Ali Mayladan, Hasan Nasrallah, Hasan Moughnieh, Mustafa Shukor, Ali J. Ghandour · 2023

Foundation models have excelled in various tasks but are often evaluated on general benchmarks. The adaptation of these models for specific domains, such as remote sensing imagery, remains an underexplored area. In remote sensing, precise …

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning Open

Mustafa Shukor, Alexandre Ramé, Corentin Dancette, Matthieu Cord · 2023

Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with …

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks Open

Mustafa Shukor, Corentin Dancette, Alexandre Ramé, Matthieu Cord · 2023

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising …

Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards Open

Alexandre Ramé, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya , et al. · 2023

Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfect…

eP-ALM: Efficient Perceptual Augmentation of Language Models Open

Mustafa Shukor, Corentin Dancette, Matthieu Cord · 2023

Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best perfor…

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval Open

Mustafa Shukor, Nicolas Thome, Matthieu Cord · 2023

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval Open

Mustafa Shukor, Nicolas Thome, Matthieu Cord · 2022

Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking …

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Open

Mustafa Shukor, Guillaume Couairon, Matthieu Cord · 2022

Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem re…

Video Coding Using Learned Latent GAN Compression Open

Mustafa Shukor, Bharath Bhushan Damodaran, Xu Yao, Pierre Hellier · 2022

We propose in this paper a new paradigm for facial video compression. We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video, including intra and inter compression. Each frame is inverted in the late…

Semantic Unfolding of StyleGAN Latent Space Open

Mustafa Shukor, Xu Yao, Bharath Bushan Damodaran, Pierre Hellier · 2022

Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to an input real image. This editing property emerges from the disentangled natur…

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval Open

Mustafa Shukor, Guillaume Couairon, Asya Grechka, Matthieu Cord · 2022

Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside …

Buildings Classification using Very High Resolution Satellite Imagery Open

Mohammad Dimassi, Abed Ellatif Samhat, Mohammad Zaraket, Jamal Ibrahim Haidar, Mustafa Shukor , et al. · 2021

Buildings classification using satellite images is becoming more important for several applications such as damage assessment, resource allocation, and population estimation. We focus, in this work, on buildings damage assessment (BDA) and…

Sci-Net: Scale Invariant Model for Buildings Segmentation from Aerial Imagery Open

Hasan Nasrallah, Mustafa Shukor, Ali J. Ghandour · 2021

Buildings' segmentation is a fundamental task in the field of earth observation and aerial imagery analysis. Most existing deep learning-based methods in the literature can be applied to a fixed or narrow-range spatial resolution imagery. …

Synthetic training data generation for deep learning based quality inspection Open

Pierre Gutierrez, Maria Luschkova, Antoine Cordier, Mustafa Shukor, Mona Schappert , et al. · 2021

Deep learning is now the gold standard in computer vision-based quality inspection systems. In order to detect defects, supervised learning is often utilized, but necessitates a large amount of annotated images, which can be costly: collec…

Mustafa Shukor YOU? Author Swipe