Iftekhar Naim
YOU?
Author Swipe
View article: EmbeddingGemma: Powerful and Lightweight Text Representations
EmbeddingGemma: Powerful and Lightweight Text Representations Open
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and…
View article: On the Theoretical Limitations of Embedding-Based Retrieval
On the Theoretical Limitations of Embedding-Based Retrieval Open
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for an…
View article: Gemini Embedding: Generalizable Embeddings from Gemini
Gemini Embedding: Generalizable Embeddings from Gemini Open
In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilitie…
View article: Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? Open
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire cor…
View article: Gecko: Versatile Text Embeddings Distilled from Large Language Models
Gecko: Versatile Text Embeddings Distilled from Large Language Models Open
We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation proces…
View article: Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
Rethinking the Role of Token Retrieval in Multi-Vector Retrieval Open
Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020] allow token-level interactions between queries and documents, and hence achieve state of the art on many information retrieval benchmarks. However, their non-linear …
View article: TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models Open
Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sent…
View article: How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling?
How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling? Open
Sequence labeling is a core task in text understanding for IE/IR systems. Text generation models have increasingly become the go-to solution for such tasks (e.g., entity extraction and dialog slot filling). While most research has focused …
View article: Multi-Vector Retrieval as Sparse Alignment
Multi-Vector Retrieval as Sparse Alignment Open
Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR…
View article: Transforming Sequence Tagging Into A Seq2Seq Task
Transforming Sequence Tagging Into A Seq2Seq Task Open
Pretrained, large, generative language models (LMs) have had great success in a wide range of sequence tagging and structured prediction tasks. Casting a sequence tagging task as a Seq2Seq one requires deciding the formats of the input and…
View article: Transforming Sequence Tagging Into A Seq2Seq Task
Transforming Sequence Tagging Into A Seq2Seq Task Open
Pretrained, large, generative language models (LMs) have had great success in a wide range of sequence tagging and structured prediction tasks. Casting a sequence tagging task as a Seq2Seq one requires deciding the formats of the input and…
View article: Feature-Based Decipherment for Machine Translation
Feature-Based Decipherment for Machine Translation Open
Orthographic similarities across languages provide a strong signal for unsupervised probabilistic transduction (decipherment) for closely related language pairs. The existing decipherment models, however, are not well suited for exploiting…