Sivan Doveh
YOU?
Author Swipe
View article: TTRV: Test-Time Reinforcement Learning for Vision Language Models
TTRV: Test-Time Reinforcement Learning for Vision Language Models Open
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose T…
View article: VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes Open
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth respon…
View article: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence Open
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instru…
View article: Teaching VLMs to Localize Specific Objects from In-context Examples
Teaching VLMs to Localize Specific Objects from In-context Examples Open
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these adva…
View article: Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement
Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement Open
It has been shown that Large Language Models' (LLMs) performance can be improved for many tasks using Chain of Thought (CoT) or In-Context Learning (ICL), which involve demonstrating the steps needed to solve a task using a few examples. H…
View article: LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Open
The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside…
View article: GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models Open
In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, queryin…
View article: Comparison Visual Instruction Tuning
Comparison Visual Instruction Tuning Open
Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually re…
View article: ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs Open
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkab…
View article: NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning
NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning Open
Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated …
View article: Towards Multimodal In-Context Learning for Vision & Language Models
Towards Multimodal In-Context Learning for Vision & Language Models Open
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decode…
View article: Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models Open
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the …
View article: Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Open
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural lang…
View article: MAEDAY: MAE for few and zero shot AnomalY-Detection
MAEDAY: MAE for few and zero shot AnomalY-Detection Open
We propose using Masked Auto-Encoder (MAE), a transformer model self-supervisedly trained on image inpainting, for anomaly detection (AD). Assuming anomalous regions are harder to reconstruct compared with normal regions. MAEDAY is the fir…
View article: Teaching Structured Vision&Language Concepts to Vision&Language Models
Teaching Structured Vision&Language Concepts to Vision&Language Models Open
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vi…
View article: Detector-Free Weakly Supervised Grounding by Separation
Detector-Free Weakly Supervised Grounding by Separation Open
Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to groun…
View article: StarNet: towards Weakly Supervised Few-Shot Object Detection
StarNet: towards Weakly Supervised Few-Shot Object Detection Open
Few-shot detection and classification have advanced significantly in recent years. Yet, detection approaches require strong annotation (bounding boxes) both for pre-training and for adaptation to novel classes, and classification approache…
View article: StarNet: towards Weakly Supervised Few-Shot Object Detection
StarNet: towards Weakly Supervised Few-Shot Object Detection Open
Few-shot detection and classification have advanced significantly in recent years. Yet, detection approaches require strong annotation (bounding boxes) both for pre-training and for adaptation to novel classes, and classification approache…
View article: StarNet: towards weakly supervised few-shot detection and explainable few-shot classification
StarNet: towards weakly supervised few-shot detection and explainable few-shot classification Open
Few-shot learning for classification has advanced significantly in recent years. Yet, these approaches rarely provide interpretability related to their decisions or localization of objects in the scene. In this paper, we introduce StarNet,…
View article: ASAP: Architecture Search, Anneal and Prune
ASAP: Architecture Search, Anneal and Prune Open
Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of the search process. As some primal methods optimized over a disc…