Explanipedia

Group-Level Data Selection for Efficient Pretraining Open

Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih , et al. · 2025

Computer science Physics

In this paper, we introduce Group-MATES, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a rel…

Improving Multitask Retrieval by Promoting Task Specialization Open

Wenzheng Zhang, Chenyan Xiong, Karl Stratos, Arnold Overwijk · 2023

Computer science Psychology Engineering

In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval in which a separate retriever is trained fo…

Augmenting Zero-Shot Dense Retrievers with Plug-in Mixture-of-Memories Open

Suyu Ge, Chenyan Xiong, Corby Rosset, Arnold Overwijk, Jiawei Han , et al. · 2023

Computer science Mathematics Philosophy

In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora ("external memories"), with the…

Improving Multitask Retrieval by Promoting Task Specialization Open

Wenzheng Zhang, Chenyan Xiong, Karl Stratos, Arnold Overwijk · 2023

Computer science Psychology Economics

In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval, in which a separate retriever is trained f…

Augmenting Zero-Shot Dense Retrievers with Plug-in Mixture-of-Memories Open

Suyu Ge, Chenyan Xiong, Corby Rosset, Arnold Overwijk, Jiawei Han , et al. · 2023

Computer science Mathematics Geography

In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora (external memories), with the o…

ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information Open

Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie Callan · 2022

Computer science Philosophy

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academi…

Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives Open

Si Sun, Chenyan Xiong, Yue Yu, Arnold Overwijk, Zhiyuan Liu , et al. · 2022

Computer science Mathematics Physics

In this paper, we investigate the instability in the standard dense retrieval training, which iterates between model training and hard negative selection using the being-trained model. We show the catastrophic forgetting phenomena behind t…

COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning Open

Yue Yu, Chenyan Xiong, Si Sun, Chao Zhang, Arnold Overwijk · 2022

Computer science Mathematics Geography

We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact o…

COCO-DR: Combating Distribution Shift in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning Open

Yue Yu, Chenyan Xiong, Si Sun, Chao Zhang, Arnold Overwijk · 2022

Computer science Mathematics Geography

We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact o…

Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives Open

Si Sun, Chenyan Xiong, Yue Yu, Arnold Overwijk, Zhiyuan Liu , et al. · 2022

Computer science Mathematics Economics

In this paper, we investigate the instability in the standard dense retrieval training, which iterates between model training and hard negative selection using the being-trained model. We show the catastrophic forgetting phenomena behind t…

Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder Open

Shaoying Lu, He, Di, Caiqiao Xiong, Guolin Ke, Waleed Malik , et al. · 2021

Computer science Philosophy Biology

Dense retrieval requires high-quality text sequence embeddings to support effective search in the representation space. Autoencoder-based language models are appealing in dense retrieval as they train the encoder to output high-quality emb…

Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder. Open

Shuqi Lu, Chenyan Xiong, Di He, Guolin Ke, Waleed Malik , et al. · 2021

Computer science Mathematics Political science

Many real-world applications use Siamese networks to efficiently match text sequences at scale, which require high-quality sequence encodings. This paper pre-trains language models dedicated to sequence matching in Siamese architectures. W…

Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder Open

Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik , et al. · 2021

Computer science

Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tie-Yan Liu, Arnold Overwijk. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval Open

Lee Xiong, Chenyan Xiong, Ye Li, Kehan Tang, Jialin Liu , et al. · 2020

Computer science Political science

Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we ident…

Open Domain Web Keyphrase Extraction Beyond Language Modeling Open

Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, Arnold Overwijk · 2019

Computer science Mathematics Philosophy

This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one h…

Open Domain Web Keyphrase Extraction Beyond Language Modeling Open

Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, Arnold Overwijk · 2019

Computer science Engineering Mathematics

Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, Arnold Overwijk. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJC…

Arnold Overwijk YOU? Author Swipe