Evan Frick
YOU?
Author Swipe
View article: Prompt-to-Leaderboard
Prompt-to-Leaderboard Open
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To addres…
View article: How to Evaluate Reward Models for RLHF
How to Evaluate Reward Models for RLHF Open
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline an…
View article: From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline Open
The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned ben…