Moritz Hardt
YOU?
Author Swipe
View article: Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning Open
Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinf…
View article: Train-before-Test Harmonizes Language Model Rankings
Train-before-Test Harmonizes Language Model Rankings Open
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to …
View article: Policy Design in Long-Run Welfare Dynamics
Policy Design in Long-Run Welfare Dynamics Open
Improving social welfare is a complex challenge requiring policymakers to optimize objectives across multiple time horizons. Evaluating the impact of such policies presents a fundamental challenge, as those that appear suboptimal in the sh…
View article: Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data Open
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use s…
View article: Decline Now: A Combinatorial Model for Algorithmic Collective Action
Decline Now: A Combinatorial Model for Algorithmic Collective Action Open
Drivers on food delivery platforms often run a loss on low-paying orders. In response, workers on DoorDash started a campaign, #DeclineNow, to purposefully decline orders below a certain pay threshold. For each declined order, the platform…
View article: Lawma: The Power of Specialization for Legal Annotation
Lawma: The Power of Specialization for Legal Annotation Open
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal …
View article: Evaluating language models as risk scores
Evaluating language models as risk scores Open
Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate…
View article: Training on the Test Task Confounds Evaluation and Emergence
Training on the Test Task Confounds Evaluation and Emergence Open
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a …
View article: Limits to Predicting Online Speech Using Large Language Models
Limits to Predicting Online Speech Using Large Language Models Open
We study the predictability of online speech on social media, and whether predictability improves with information outside a user's own posts. Recent theoretical results suggest that posts from a user's social circle are as predictive of t…
View article: Allocation Requires Prediction Only if Inequality Is Low
Allocation Requires Prediction Only if Inequality Is Low Open
Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We…
View article: Causal Inference from Competing Treatments
Causal Inference from Competing Treatments Open
Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes…
View article: An engine not a camera: Measuring performative power of online search
An engine not a camera: Measuring performative power of online search Open
The power of digital platforms is at the center of major ongoing policy and regulatory efforts. To advance existing debates, we designed and executed an experiment to measure the performative power of online search providers. Instantiated …
View article: Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks Open
We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction betwe…
View article: ImageNot: A contrast with ImageNet preserves model rankings
ImageNot: A contrast with ImageNet preserves model rankings Open
We introduce ImageNot, a dataset constructed explicitly to be drastically different than ImageNet while matching its scale. ImageNot is designed to test the external validity of deep learning progress on ImageNet. We show that key model ar…
View article: Do causal predictors generalize better to new domains?
Do causal predictors generalize better to new domains? Open
We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each…
View article: Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget
Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget Open
We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. …
View article: Is Your Model Predicting the Past?
Is Your Model Predicting the Past? Open
When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, …
View article: Performative Prediction: Past and Future
Performative Prediction: Past and Future Open
Predictions in the social world generally influence the target of prediction, a phenomenon known as performativity. Self-fulfilling and self-negating predictions are examples of performativity. Of fundamental importance to economics, finan…
View article: What Makes ImageNet Look Unlike LAION
What Makes ImageNet Look Unlike LAION Open
ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We fi…
View article: Questioning the Survey Responses of Large Language Models
Questioning the Survey Responses of Large Language Models Open
Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values bes…
View article: Unprocessing Seven Years of Algorithmic Fairness
Unprocessing Seven Years of Algorithmic Fairness Open
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empir…
View article: Test-Time Training on Nearest Neighbors for Large Language Models
Test-Time Training on Nearest Neighbors for Large Language Models Open
Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linea…
View article: Difficult Lessons on Social Prediction from Wisconsin Public Schools
Difficult Lessons on Social Prediction from Wisconsin Public Schools Open
Early warning systems (EWS) are predictive tools at the center of recent efforts to improve graduation rates in public schools across the United States. These systems assist in targeting interventions to individual students by predicting w…
View article: Causal Inference out of Control: Estimating the Steerability of Consumption
Causal Inference out of Control: Estimating the Steerability of Consumption Open
Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on consumption. We introduce a general causal inference problem we call the steerability of consumption that abst…
View article: Algorithmic Collective Action in Machine Learning
Algorithmic Collective Action in Machine Learning Open
We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm's learning algorithm. The collectiv…
View article: County-level Algorithmic Audit of Racial Bias in Twitter's Home Timeline
County-level Algorithmic Audit of Racial Bias in Twitter's Home Timeline Open
We report on the outcome of an audit of Twitter's Home Timeline ranking system. The goal of the audit was to determine if authors from some racial groups experience systematically higher impression counts for their Tweets than others. A ce…
View article: A Theory of Dynamic Benchmarks
A Theory of Dynamic Benchmarks Open
Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags…
View article: Is your model predicting the past?
Is your model predicting the past? Open
When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, …
View article: Adversarial Scrutiny of Evidentiary Statistical Software
Adversarial Scrutiny of Evidentiary Statistical Software Open
The U.S. criminal legal system increasingly relies on software output to convict and incarcerate people. In a large number of cases each year, the government makes these consequential decisions based on evidence from statistical software -…