Florian E. Dorner
YOU?
Author Swipe
View article: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead Open
Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often inte…
View article: ROC-n-reroll: How verifier imperfection affects test-time scaling
ROC-n-reroll: How verifier imperfection affects test-time scaling Open
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier …
View article: Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data Open
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use s…
View article: Training on the Test Task Confounds Evaluation and Emergence
Training on the Test Task Confounds Evaluation and Emergence Open
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a …
View article: Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback
Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback Open
There is a growing body of work on learning from human feedback to align various aspects of machine learning systems with human values and preferences. We consider the setting of fairness in content moderation, in which human feedback is u…
View article: Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget
Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget Open
We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. …
View article: Challenging the Validity of Personality Tests for Large Language Models
Challenging the Validity of Personality Tests for Large Language Models Open
With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for huma…
View article: Incentivizing Honesty among Competitors in Collaborative Learning and Optimization
Incentivizing Honesty among Competitors in Collaborative Learning and Optimization Open
Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity's data. However, in many cases, potential participants in such collaborative schemes are…
View article: Human-Guided Fair Classification for Natural Language Processing
Human-Guided Fair Classification for Natural Language Processing Open
Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes…
View article: Algorithmic collusion: A critical review
Algorithmic collusion: A critical review Open
The prospect of collusive agreements being stabilized via the use of pricing algorithms is widely discussed by antitrust experts and economists. However, the literature is often lacking the perspective of computer scientists, and seems to …
View article: Measuring Progress in Deep Reinforcement Learning Sample Efficiency
Measuring Progress in Deep Reinforcement Learning Sample Efficiency Open
Sampled environment transitions are a critical input to deep reinforcement learning (DRL) algorithms. Current DRL benchmarks often allow for the cheap and easy generation of large amounts of samples such that perceived progress in DRL does…