Explanipedia

Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead Open

Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, Samira Samadi · 2025

Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often inte…

ROC-n-reroll: How verifier imperfection affects test-time scaling Open

Florian E. Dorner, Yatong Chen, Alessandra G. Cruz, Fanny Yang · 2025

Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier …

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data Open

Florian E. Dorner, Vivian Y. Nastl, Moritz Hardt · 2024

High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use s…

Training on the Test Task Confounds Evaluation and Emergence Open

Ricardo Dominguez-Olmedo, Florian E. Dorner, Moritz Hardt · 2024

Psychology Computer science Economics

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a …

Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback Open

Emilia Agis Lerner, Florian E. Dorner, Elliott Ash, Naman Goel · 2024

Computer science Psychology Business

There is a growing body of work on learning from human feedback to align various aspects of machine learning systems with human values and preferences. We consider the setting of fairness in content moderation, in which human feedback is u…

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget Open

Florian E. Dorner, Moritz Hardt · 2024

Mathematics Computer science Business

We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. …

Challenging the Validity of Personality Tests for Large Language Models Open

Florian E. Dorner, Tom Sühr, Samira Samadi, Augustin Kelava · 2023

Psychology Biology Physics

With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for huma…

Incentivizing Honesty among Competitors in Collaborative Learning and Optimization Open

Florian E. Dorner, Nikola Konstantinov, Georgi Pashaliev, Martin Vechev · 2023

Computer science Business Economics

Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity's data. However, in many cases, potential participants in such collaborative schemes are…

Human-Guided Fair Classification for Natural Language Processing Open

Florian E. Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel, Elliott Ash , et al. · 2022

Computer science Psychology

Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes…

Algorithmic collusion: A critical review Open

Florian E. Dorner · 2021

Computer science Economics Political science

The prospect of collusive agreements being stabilized via the use of pricing algorithms is widely discussed by antitrust experts and economists. However, the literature is often lacking the perspective of computer scientists, and seems to …

Measuring Progress in Deep Reinforcement Learning Sample Efficiency Open

Florian E. Dorner · 2021

Computer science Engineering Chemistry

Sampled environment transitions are a critical input to deep reinforcement learning (DRL) algorithms. Current DRL benchmarks often allow for the cheap and easy generation of large amounts of samples such that perceived progress in DRL does…

Florian E. Dorner YOU? Author Swipe