Vasudev Lal
YOU?
Author Swipe
View article: DPO Learning with LLMs-Judge Signal for Computer Use Agents
DPO Learning with LLMs-Judge Signal for Computer Use Agents Open
Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typ…
View article: Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning
Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning Open
Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear wheth…
View article: Cultural Awareness in Vision-Language Models: A Cross-Country Exploration
Cultural Awareness in Vision-Language Models: A Cross-Country Exploration Open
Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differ…
View article: Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders
Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders Open
The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how visio…
View article: Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias
Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias Open
CLIP is one of the most popular foundation models and is heavily used for many vision-language tasks, yet little is known about its inner workings. As CLIP is increasingly deployed in real-world applications, it is becoming even more criti…
View article: Probing Semantic Routing in Large Mixture-of-Expert Models
Probing Semantic Routing in Large Mixture-of-Expert Models Open
In the past year, large (>100B parameter) mixture-of-expert (MoE) models have become increasingly common in the open domain. While their advantages are often framed in terms of efficiency, prior work has also explored functional differenti…
View article: FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability
FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability Open
Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as…
View article: A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment Open
Do generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learn a world model from which a sequence is generated one token at a time? We address this question by deriving a causal interpretation…
View article: Steering Large Language Models to Evaluate and Amplify Creativity
Steering Large Language Models to Evaluate and Amplify Creativity Open
Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes "creativity". In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what…
View article: Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning
Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning Open
Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear wheth…
View article: FastRM: An efficient and automatic explainability framework for multimodal generative models
FastRM: An efficient and automatic explainability framework for multimodal generative models Open
Large Vision Language Models (LVLMs) have demonstrated remarkable reasoning capabilities over textual and visual inputs. However, these models remain prone to generating misinformation. Identifying and mitigating ungrounded responses is cr…
View article: Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering
Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering Open
Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as general-purpose chatbots able to engage in conversations about visual inputs. However, their responses are influenced by societal biases present in their training…
View article: Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency
Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency Open
Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods p…
View article: Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations Open
Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal bia…
View article: Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review
Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review Open
Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manus…
View article: Quantifying and Enabling the Interpretability of CLIP-like Models
Quantifying and Enabling the Interpretability of CLIP-like Models Open
CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. To bridge this gap we propose a study to quantify the interpretability in CL…
View article: ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution
ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution Open
Detecting and attributing temperature increases driven by climate change is crucial for understanding global warming and informing adaptation strategies. However, distinguishing human-induced climate signals from natural variability remain…
View article: Why do LLaVA Vision-Language Models Reply to Images in English?
Why do LLaVA Vision-Language Models Reply to Images in English? Open
We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an Engli…
View article: SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs Open
Multimodal retrieval augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where external knowledge is needed to answer a question. However, existing multimodal LLMs (MLLMs) …
View article: L-MAGIC: Language Model Assisted Generation of Images with Coherence
L-MAGIC: Language Model Assisted Generation of Images with Coherence Open
In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack …
View article: LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models
LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models Open
In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. How…
View article: Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Getting it Right: Improving Spatial Consistency in Text-to-Image Models Open
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive in…
View article: LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model Open
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides oppor…
View article: SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples Open
While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies…
View article: NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation
NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation Open
Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive …
View article: LDM3D-VR: Latent Diffusion Model for 3D VR
LDM3D-VR: Latent Diffusion Model for 3D VR Open
Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of di…