Or Honovich
YOU?
Author Swipe
View article: Keep Guessing? When Considering Inference Scaling, Mind the Baselines
Keep Guessing? When Considering Inference Scaling, Mind the Baselines Open
Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is parti…
View article: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains Open
Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literatu…
View article: Surfacing Biases in Large Language Models using Contrastive Input Decoding
Surfacing Biases in Large Language Models using Contrastive Input Decoding Open
Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an eva…
View article: DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering
DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering Open
Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, Omri Abend. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
View article: LMentry: A Language Model Benchmark of Elementary Language Tasks
LMentry: A Language Model Benchmark of Elementary Language Tasks Open
As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well.We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to hum…
View article: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor Open
Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interacti…
View article: Instruction Induction: From Few Examples to Natural Language Task Descriptions
Instruction Induction: From Few Examples to Natural Language Task Descriptions Open
Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations…
View article: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor Open
Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interacti…
View article: DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering
DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering Open
Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a W…
View article: LMentry: A Language Model Benchmark of Elementary Language Tasks
LMentry: A Language Model Benchmark of Elementary Language Tasks Open
As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to hu…
View article: Instruction Induction: From Few Examples to Natural Language Task Descriptions
Instruction Induction: From Few Examples to Natural Language Task Descriptions Open
Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations…
View article: TRUE: Re-evaluating Factual Consistency Evaluation
TRUE: Re-evaluating Factual Consistency Evaluation Open
Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cy…
View article: TRUE: Re-evaluating Factual Consistency Evaluation
TRUE: Re-evaluating Factual Consistency Evaluation Open
Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, Yossi Matias. Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conver…
View article: TRUE: Re-evaluating Factual Consistency Evaluation
TRUE: Re-evaluating Factual Consistency Evaluation Open
Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, Yossi Matias. Proceedings of the 2022 Conference of the North American Chapter of the Association…
View article: $Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering Open
Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating f…
View article: Q2: : Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
Q2: : Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering Open
Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating f…
View article: Machine Reading of Historical Events
Machine Reading of Historical Events Open
Machine reading is an ambitious goal in NLP that subsumes a wide range of text understanding capabilities. Within this broad framework, we address the task of machine reading the time of historical events, compile datasets for the task, an…