Oncel Tuzel
YOU?
Author Swipe
View article: Learning from Self Critique and Refinement for Faithful LLM Summarization
Learning from Self Critique and Refinement for Faithful LLM Summarization Open
Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can …
View article: Learning to Reason for Hallucination Span Detection
Learning to Reason for Hallucination Span Detection Open
Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucin…
View article: Pretraining with hierarchical memories: separating long-tail and common knowledge
Pretraining with hierarchical memories: separating long-tail and common knowledge Open
The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a frac…
View article: MobileCLIP2: Improving Multi-Modal Reinforced Training
MobileCLIP2: Improving Multi-Modal Reinforced Training Open
Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy…
View article: Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting Open
Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue o…
View article: FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations Open
Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or …
View article: TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining Open
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pre…
View article: Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization
Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization Open
In this work, we propose Mutual Reinforcing Data Synthesis (MRDS) within LLMs to improve few-shot dialogue summarization task. Unlike prior methods that require external knowledge, we mutually reinforce the LLMś dialogue synthesis and summ…
View article: TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining Open
View article: Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization
Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization Open
View article: 3D Shape Tokenization via Latent Flow Matching
3D Shape Tokenization via Latent Flow Matching Open
We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering co…
View article: FastVLM: Efficient Vision Encoding for Vision Language Models
FastVLM: Efficient Vision Encoding for Vision Language Models Open
Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high r…
View article: Synth4Seg -- Learning Defect Data Synthesis for Defect Segmentation using Bi-level Optimization
Synth4Seg -- Learning Defect Data Synthesis for Defect Segmentation using Bi-level Optimization Open
Defect segmentation is crucial for quality control in advanced manufacturing, yet data scarcity poses challenges for state-of-the-art supervised deep learning. Synthetic defect data generation is a popular approach for mitigating data chal…
View article: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models Open
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-schoo…
View article: MUSCLE: A Model Update Strategy for Compatible LLM Evolution
MUSCLE: A Model Update Strategy for Compatible LLM Evolution Open
Large Language Models (LLMs) are regularly updated to enhance performance, typically through changes in data or architecture. Within the update process, developers often prioritize improving overall performance metrics, paying less attenti…
View article: Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions Open
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is n…
View article: Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum Open
Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predeter…
View article: CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks Open
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentat…
View article: CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data Open
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and te…
View article: Weight subcloning: direct initialization of transformers using larger pretrained ones
Weight subcloning: direct initialization of transformers using larger pretrained ones Open
Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretraine…
View article: Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications
Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications Open
We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. Wh…
View article: Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models
Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models Open
Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot …
View article: HUGS: Human Gaussian Splats
HUGS: Human Gaussian Splats Open
Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do no…
View article: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training Open
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders…
View article: TiC-CLIP: Continual Training of CLIP Models
TiC-CLIP: Continual Training of CLIP Models Open
Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any …
View article: Novel-View Acoustic Synthesis from 3D Reconstructed Rooms
Novel-View Acoustic Synthesis from 3D Reconstructed Rooms Open
We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown…
View article: SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding Open
The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance…
View article: CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement Open
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capab…
View article: ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models Open
Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. …
View article: Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models Open
While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are…