Explanipedia

DPO Learning with LLMs-Judge Signal for Computer Use Agents Open

Man Luo, David Cobbley, Xin Su, Shachar Rosenman, Vasudev Lal , et al. · 2025

Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typ…

Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning Open

Neale Ratzlaff, Man Luo, Xin Su, Vasudev Lal, Phillip Howard · 2025

Computer science Geography

Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear wheth…

Cultural Awareness in Vision-Language Models: A Cross-Country Exploration Open

Avinash Madasu, Vasudev Lal, Phillip Howard · 2025

Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differ…

Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders Open

Matthew Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard , et al. · 2025

The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how visio…

Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias Open

Avinash Madasu, Vasudev Lal, Phillip Howard · 2025

CLIP is one of the most popular foundation models and is heavily used for many vision-language tasks, yet little is known about its inner workings. As CLIP is increasingly deployed in real-world applications, it is becoming even more criti…

Probing Semantic Routing in Large Mixture-of-Expert Models Open

Matthew Olson, Neale Ratzlaff, Musashi Hinck, Man Luo, Sungduk Yu , et al. · 2025

Economics Computer science Geography

In the past year, large (>100B parameter) mixture-of-expert (MoE) models have become increasingly common in the open domain. While their advantages are often framed in terms of efficiency, prior work has also explored functional differenti…

FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability Open

Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman , et al. · 2024

Computer science

Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as…

A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment Open

Raanan Y. Rohekar, Yaniv Gurwicz, S. Yu, Vasudev Lal · 2024

Computer science Political science

Do generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learn a world model from which a sequence is generated one token at a time? We address this question by deriving a causal interpretation…

Steering Large Language Models to Evaluate and Amplify Creativity Open

Matthew Olson, Neale Ratzlaff, Musashi Hinck, Shao-Yen Tseng, Vasudev Lal · 2024

Computer science Psychology Philosophy

Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes "creativity". In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what…

Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning Open

Neale Ratzlaff, Man Luo, Xin Su, Vasudev Lal, Phillip Howard · 2024

Computer science

Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear wheth…

FastRM: An efficient and automatic explainability framework for multimodal generative models Open

Gabriela Ben-Melech Stan, Estelle Aflalo, Man Luo, Shachar Rosenman, Tiep Le , et al. · 2024

Computer science

Large Vision Language Models (LVLMs) have demonstrated remarkable reasoning capabilities over textual and visual inputs. However, these models remain prone to generating misinformation. Identifying and mitigating ungrounded responses is cr…

Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering Open

Neale Ratzlaff, Matthew Olson, Musashi Hinck, Estelle Aflalo, Shao-Yen Tseng , et al. · 2024

Computer science Geology Chemistry

Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as general-purpose chatbots able to engage in conversations about visual inputs. However, their responses are influenced by societal biases present in their training…

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency Open

Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong , et al. · 2024

Computer science

Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods p…

Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations Open

Neale Ratzlaff, Matthew Olson, Musashi Hinck, Shao-Yen Tseng, Vasudev Lal , et al. · 2024

Computer science Psychology

Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal bia…

Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review Open

Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, Phillip Howard · 2024

Computer science Psychology Political science

Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manus…

Quantifying and Enabling the Interpretability of CLIP-like Models Open

Avinash Madasu, Yossi Gandelsman, Vasudev Lal, Phillip Howard · 2024

Computer science

CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. To bridge this gap we propose a study to quantify the interpretability in CL…

ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution Open

Sungduk Yu, Brian White, Anahita Bhiwandiwalla, Musashi Hinck, Matthew Olson , et al. · 2024

Computer science Environmental science Psychology

Detecting and attributing temperature increases driven by climate change is crucial for understanding global warming and informing adaptation strategies. However, distinguishing human-induced climate signals from natural variability remain…

Why do LLaVA Vision-Language Models Reply to Images in English? Open

Musashi Hinck, Carolin Holtermann, Matthew Olson, Florian Schneider, Sungduk Yu , et al. · 2024

Computer science Psychology Philosophy

We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an Engli…

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs Open

Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal , et al. · 2024

Geography

Multimodal retrieval augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where external knowledge is needed to answer a question. However, existing multimodal LLMs (MLLMs) …

L-MAGIC: Language Model Assisted Generation of Images with Coherence Open

Zhipeng Cai, Matthias Mueller, Reiner Birkl, Diana Wofk, Shao-Yen Tseng , et al. · 2024

Computer science Physics

In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack …

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models Open

Gabriela Ben Melech Stan, Raanan Y. Rohekar, Yaniv Gurwicz, Matthew Olson, Anahita Bhiwandiwalla , et al. · 2024

Computer science Philosophy

In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. How…

Getting it Right: Improving Spatial Consistency in Text-to-Image Models Open

Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh , et al. · 2024

Computer science

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive in…

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model Open

Musashi Hinck, Matthew Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal · 2024

Computer science History Biology

We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides oppor…

SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples Open

Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lujan Moreno, Anahita Bhiwandiwalla , et al. · 2023

Computer science Psychology Sociology

While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies…

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation Open

Shachar Rosenman, Vasudev Lal, Phillip Howard · 2023

Computer science Engineering Philosophy

Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive …

LDM3D-VR: Latent Diffusion Model for 3D VR Open

Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai , et al. · 2023

Computer science Geography Physics

Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of di…

Vasudev Lal YOU? Author Swipe