Explanipedia

Learning from Self Critique and Refinement for Faithful LLM Summarization Open

Hadi Pouransari, Oncel Tuzel, Raviteja Vemulapalli · 2025

Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can …

Learning to Reason for Hallucination Span Detection Open

Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, K. Siva Krishna, Hadi Pouransari , et al. · 2025

Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucin…

Pretraining with hierarchical memories: separating long-tail and common knowledge Open

Hadi Pouransari, David Grangier, Christopher W. Thomas, Michael Kirchhof, Oncel Tuzel · 2025

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a frac…

MobileCLIP2: Improving Multi-Modal Reinforced Training Open

Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev , et al. · 2025

Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy…

Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting Open

Chen Huang, Skyler Seto, Hadi Pouransari, Mehrdad Farajtabar, Raviteja Vemulapalli , et al. · 2025

Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue o…

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations Open

Cheng-Yu Hsieh, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Chunliang Li , et al. · 2025

Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or …

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining Open

Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar , et al. · 2025

Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pre…

Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization Open

Yen-Ju Lu, Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Jen-Hao Rick Chang , et al. · 2025

In this work, we propose Mutual Reinforcing Data Synthesis (MRDS) within LLMs to improve few-shot dialogue summarization task. Unlike prior methods that require external knowledge, we mutually reinforce the LLMś dialogue synthesis and summ…

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining Open

Jeffrey Li, Mohammadreza Armandpour, Seyed Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar , et al. · 2025

Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization Open

Yen-Ju Lu, Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Jen-Hao Rick Chang , et al. · 2025

3D Shape Tokenization via Latent Flow Matching Open

Jen-Hao Rick Chang, Yuyang Wang, Miguel Angel Bautista Martin, Jiatao Gu, Josh Susskind , et al. · 2024

We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering co…

FastVLM: Efficient Vision Encoding for Vision Language Models Open

Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chunliang Li, Cem Koc, Nate True , et al. · 2024

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high r…

Synth4Seg -- Learning Defect Data Synthesis for Defect Segmentation using Bi-level Optimization Open

Shancong Mou, Raviteja Vemulapalli, Shiyu Li, Yuxuan Liu, Christopher W. Thomas , et al. · 2024

Defect segmentation is crucial for quality control in advanced manufacturing, yet data scarcity poses challenges for state-of-the-art supervised deep learning. Synthetic defect data generation is a popular approach for mitigating data chal…

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models Open

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio , et al. · 2024

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-schoo…

MUSCLE: A Model Update Strategy for Compatible LLM Evolution Open

Jessica Maria Echterhoff, Fartash Faghri, Raviteja Vemulapalli, Ting-Yao Hu, Chunliang Li , et al. · 2024

Large Language Models (LLMs) are regularly updated to enhance performance, typically through changes in data or architecture. Within the update process, developers often prioritize improving overall performance metrics, paying less attenti…

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions Open

Yu-Guan Hsieh, Cheng-Yu Hsieh, S.-P. Yeh, Louis Béthune, Hadi Pour Ansari , et al. · 2024

Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is n…

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum Open

Hadi Pouransari, Chunliang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc , et al. · 2024

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predeter…

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks Open

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel Tuzel · 2024

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentat…

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data Open

Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi , et al. · 2024

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and te…

Weight subcloning: direct initialization of transformers using larger pretrained ones Open

Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri , et al. · 2023

Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretraine…

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications Open

Karren Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel · 2023

We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. Wh…

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models Open

Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar , et al. · 2023

Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot …

HUGS: Human Gaussian Splats Open

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, Anurag Ranjan · 2023

Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do no…

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training Open

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel · 2023

Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders…

TiC-CLIP: Continual Training of CLIP Models Open

Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta , et al. · 2023

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any …

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms Open

Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan , et al. · 2023

We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown…

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding Open

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar , et al. · 2023

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance…

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement Open

Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari , et al. · 2023

Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capab…

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models Open

Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C. Del Mundo, Oncel Tuzel , et al. · 2023

Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. …

Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models Open

Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Raviteja Vemulapalli, Jen-Hao Rick Chang , et al. · 2023

While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are…

Oncel Tuzel YOU? Author Swipe