Gabriel Stanovsky
YOU?
Author Swipe
View article: PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation Open
Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limit…
View article: Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs Open
Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tu…
View article: ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments Open
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments…
View article: Cooking Up Creativity: Enhancing LLM Creativity through Structured Recombination
Cooking Up Creativity: Enhancing LLM Creativity through Structured Recombination Open
Large Language Models (LLMs) excel at many tasks, yet they struggle to produce truly creative, diverse ideas. In this paper, we introduce a novel approach that enhances LLM creativity. We apply LLMs for translating between natural language…
View article: Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs
Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs Open
The surge of LLM studies makes synthesizing their findings challenging. Analysis of experimental results from literature can uncover important trends across studies, but the time-consuming nature of manual data extraction limits its use. O…
View article: Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer Open
Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a q…
View article: Beyond Benchmarks: On The False Promise of AI Regulation
Beyond Benchmarks: On The False Promise of AI Regulation Open
The performance of AI models on safety benchmarks does not indicate their real-world performance after deployment. This opaqueness of AI models impedes existing regulatory frameworks constituted on benchmark performance, leaving them incap…
View article: Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time
Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time Open
Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training t…
View article: Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends, and Metrics Analysis
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends, and Metrics Analysis Open
The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image caption…
View article: DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation Open
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation pract…
View article: The State and Fate of Summarization Datasets: A Survey
The State and Fate of Summarization Datasets: A Survey Open
Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked com…
View article: SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction
SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction Open
Many human interactions, such as political debates, are carried out in group settings, where there are arbitrarily many participants, each with different views and agendas. To explore such complex social settings, we present SAUCE: a custo…
View article: Looking Beyond The Top-1: Transformers Determine Top Tokens In Order
Looking Beyond The Top-1: Transformers Determine Top Tokens In Order Open
Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed…
View article: Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis Open
The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image caption…
View article: SEAM: A Stochastic Benchmark for Multi-Document Tasks
SEAM: A Stochastic Benchmark for Multi-Document Tasks Open
Various tasks, such as summarization, multi-hop question answering, or coreference resolution, are naturally phrased over collections of real-world documents. Such tasks present a unique set of challenges, revolving around the lack of cohe…
View article: In-Context Learning on a Budget: A Case Study in Token Classification
In-Context Learning on a Budget: A Case Study in Token Classification Open
Few shot in-context learning (ICL) typically assumes access to large annotated training sets. However, in many real world scenarios, such as domain adaptation, there is only a limited budget to annotate a small number of samples, with the …
View article: Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation
Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation Open
Most works on gender bias focus on intrinsic bias -- removing traces of information about a protected group from the model's internal representation. However, these works are often disconnected from the impact of such debiasing on downstre…
View article: A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns
A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns Open
Cross-domain alignment refers to the task of mapping a concept from one domain to another. For example, ``If a \textit{doctor} were a \textit{color}, what color would it be?''. This seemingly peculiar task is designed to investigate how pe…
View article: Computation or Weight Adaptation? Rethinking the Role of Plasticity in Learning
Computation or Weight Adaptation? Rethinking the Role of Plasticity in Learning Open
The human brain is an adaptive learning system that can generalize to new tasks and unfamiliar environments. The traditional view is that such adaptive behavior requires a structural change of the learning system (e.g., via neural plastici…
View article: Do Zombies Understand? A Choose-Your-Own-Adventure Exploration of Machine Cognition
Do Zombies Understand? A Choose-Your-Own-Adventure Exploration of Machine Cognition Open
Recent advances in LLMs have sparked a debate on whether they understand text. In this position paper, we argue that opponents in this debate hold different definitions for understanding, and particularly differ in their view on the role o…
View article: Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction
Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction Open
Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify…
View article: K-QA: A Real-World Medical Q&A Benchmark
K-QA: A Real-World Medical Q&A Benchmark Open
Ensuring the accuracy of responses provided by large language models (LLMs) is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a datas…
View article: State of What Art? A Call for Multi-Prompt LLM Evaluation
State of What Art? A Call for Multi-Prompt LLM Evaluation Open
Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittl…