Explanipedia

TTRV: Test-Time Reinforcement Learning for Vision Language Models Open

Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy , et al. · 2025

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose T…

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes Open

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa , et al. · 2025

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth respon…

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence Open

Granite Vision Team, Leonid Karlinsky, Assaf Arbelle, A Daniels, Ahmed Nassar , et al. · 2025

Computer science Geology

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instru…

Teaching VLMs to Localize Specific Objects from In-context Examples Open

Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne , et al. · 2024

Computer science Geography

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these adva…

Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement Open

Joseph Shtok, Amit Alfassy, Foad Abo Dahood, Edward J. Schwartz, Sivan Doveh , et al. · 2024

Computer science History

It has been shown that Large Language Models' (LLMs) performance can be improved for many tasks using Chain of Thought (CoT) or In-Context Learning (ICL), which involve demonstrating the steps needed to solve a task using a few examples. H…

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Open

Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza , et al. · 2024

Computer science Mathematics Geography

The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside…

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models Open

M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin , et al. · 2024

Computer science

In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, queryin…

Comparison Visual Instruction Tuning Open

Wei Lin, M. Jehanzeb Mirza, Sivan Doveh, Rogério Feris, Raja Giryes , et al. · 2024

Computer science

Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually re…

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs Open

Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob Hansen, Sivan Doveh , et al. · 2024

Computer science Political science Philosophy

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkab…

NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning Open

Eli Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky , et al. · 2024

Computer science

Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated …

Towards Multimodal In-Context Learning for Vision & Language Models Open

Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Amit Alfassy, Assaf Arbelle , et al. · 2024

Computer science Psychology Geography

State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decode…

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models Open

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim , et al. · 2023

Computer science Mathematics Engineering

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the …

Going Beyond Nouns With Vision & Language Models Using Synthetic Data Open

Paola Cascante-Bonilla, Khaled Shehada, James Smith, Sivan Doveh, Donghyun Kim , et al. · 2023

Computer science Psychology Philosophy

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural lang…

MAEDAY: MAE for few and zero shot AnomalY-Detection Open

Eli Schwartz, Assaf Arbelle, Leonid Karlinsky, Sivan Harary, Florian Scheidegger , et al. · 2022

Computer science Physics Chemistry

We propose using Masked Auto-Encoder (MAE), a transformer model self-supervisedly trained on image inpainting, for anomaly detection (AD). Assuming anomalous regions are harder to reconstruct compared with normal regions. MAEDAY is the fir…

Teaching Structured Vision&Language Concepts to Vision&Language Models Open

Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig , et al. · 2022

Computer science

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vi…

Detector-Free Weakly Supervised Grounding by Separation Open

Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev , et al. · 2021

Computer science Physics Economics

Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to groun…

StarNet: towards Weakly Supervised Few-Shot Object Detection Open

Leonid Karlinsky, Joseph Shtok, Amit Alfassy, Moshe Lichtenstein, Sivan Harary , et al. · 2021

Computer science Mathematics Economics

Few-shot detection and classification have advanced significantly in recent years. Yet, detection approaches require strong annotation (bounding boxes) both for pre-training and for adaptation to novel classes, and classification approache…

StarNet: towards Weakly Supervised Few-Shot Object Detection Open

Leonid Karlinsky, Joseph Shtok, Amit Alfassy, Moshe Lichtenstein, Sivan Harary , et al. · 2020

Computer science Mathematics Chemistry

Few-shot detection and classification have advanced significantly in recent years. Yet, detection approaches require strong annotation (bounding boxes) both for pre-training and for adaptation to novel classes, and classification approache…

StarNet: towards weakly supervised few-shot detection and explainable few-shot classification Open

Leonid Karlinsky, Joseph Shtok, Amit Alfassy, Moshe Lichtenstein, Sivan Harary , et al. · 2020

Computer science Mathematics Chemistry

Few-shot learning for classification has advanced significantly in recent years. Yet, these approaches rarely provide interpretability related to their decisions or localization of objects in the scene. In this paper, we introduce StarNet,…

ASAP: Architecture Search, Anneal and Prune Open

Asaf Noy, Niv Nayman, Tal Ridnik, Nadav Zamir, Sivan Doveh , et al. · 2019

Computer science Mathematics Economics

Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of the search process. As some primal methods optimized over a disc…

Sivan Doveh YOU? Author Swipe