Explanipedia

EmbeddingGemma: Powerful and Lightweight Text Representations Open

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins , et al. · 2025

We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and…

On the Theoretical Limitations of Embedding-Based Retrieval Open

Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee · 2025

Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for an…

Gemini Embedding: Generalizable Embeddings from Gemini Open

Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue , et al. · 2025

In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilitie…

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? Open

Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan , et al. · 2024

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire cor…

Gecko: Versatile Text Embeddings Distilled from Large Language Models Open

Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer , et al. · 2024

Computer science Biology

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation proces…

Rethinking the Role of Token Retrieval in Multi-Vector Retrieval Open

Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim , et al. · 2023

Computer science Mathematics Geography

Multi-vector retrieval models such as ColBERT [Khattab and Zaharia, 2020] allow token-level interactions between queries and documents, and hence achieve state of the art on many information retrieval benchmarks. However, their non-linear …

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models Open

Md. Kamrul Hasan, Md. Saiful Islam, Sangwu Lee, Wasifur Rahman, Iftekhar Naim , et al. · 2023

Computer science Mathematics Sociology

Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sent…

How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling? Open

Kazuma Hashimoto, Iftekhar Naim, Karthik Raman · 2022

Computer science Mathematics Engineering

Sequence labeling is a core task in text understanding for IE/IR systems. Text generation models have increasingly become the go-to solution for such tasks (e.g., entity extraction and dialog slot filling). While most research has focused …

Multi-Vector Retrieval as Sparse Alignment Open

Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma , et al. · 2022

Computer science Mathematics Geography

Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR…

Transforming Sequence Tagging Into A Seq2Seq Task Open

Karthik Raman, Iftekhar Naim, Jiecao Chen, Kazuma Hashimoto, Kiran Yalasangi , et al. · 2022

Computer science Biology Mathematics

Pretrained, large, generative language models (LMs) have had great success in a wide range of sequence tagging and structured prediction tasks. Casting a sequence tagging task as a Seq2Seq one requires deciding the formats of the input and…

Transforming Sequence Tagging Into A Seq2Seq Task Open

Karthik Raman, Iftekhar Naim, Jiecao Chen, Kazuma Hashimoto, Kiran Yalasangi , et al. · 2022

Computer science Mathematics Economics

Pretrained, large, generative language models (LMs) have had great success in a wide range of sequence tagging and structured prediction tasks. Casting a sequence tagging task as a Seq2Seq one requires deciding the formats of the input and…

Feature-Based Decipherment for Machine Translation Open

Iftekhar Naim, Parker Riley, Daniel Gildea · 2018

Computer science Philosophy

Orthographic similarities across languages provide a strong signal for unsupervised probabilistic transduction (decipherment) for closely related language pairs. The existing decipherment models, however, are not well suited for exploiting…

Iftekhar Naim YOU? Author Swipe