Tom Everitt
YOU?
Author Swipe
View article: The Limits of Predicting Agents from Behaviour
The Limits of Predicting Agents from Behaviour Open
As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute…
View article: Evaluating the Goal-Directedness of Large Language Models
Evaluating the Goal-Directedness of Large Language Models Open
To what extent do LLMs use their capabilities towards their given goal? We take this as a measure of their goal-directedness. We evaluate goal-directedness on tasks that require information gathering, cognitive effort, and plan execution, …
View article: Measuring Goal-Directedness
Measuring Goal-Directedness Open
We define maximum entropy goal-directedness (MEG), a formal measure of goal-directedness in causal models and Markov decision processes, and give algorithms for computing it. Measuring goal-directedness is important, as it is a critical el…
View article: A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI Open
Recent generative AI systems have demonstrated more advanced persuasive capabilities and are increasingly permeating areas of life where they can influence decision-making. Generative AI presents a new risk profile of persuasion due the op…
View article: Reasoning about Causality in Games (Abstract Reprint)
Reasoning about Causality in Games (Abstract Reprint) Open
Causal reasoning and game-theoretic reasoning are fundamental topics in artificial intelligence, among many other disciplines: this paper is concerned with their intersection. Despite their importance, a formal framework that supports both…
View article: Discovering Agents (Abstract Reprint)
Discovering Agents (Abstract Reprint) Open
Causal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is non-trivial – often the causal model is just assumed by the modeller without much justification – and modelling fai…
View article: Robust agents learn causal world models
Robust agents learn causal world models Open
It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise to new domains, or if other inductive biase…
View article: The Reasons that Agents Act: Intention and Instrumental Goals
The Reasons that Agents Act: Intention and Instrumental Goals Open
Intention is an important and challenging concept in AI. It is important because it underlies many other concepts we care about, such as agency, manipulation, legal responsibility, and blame. However, ascribing intent to AI systems is cont…
View article: Honesty Is the Best Policy: Defining and Mitigating AI Deception
Honesty Is the Best Policy: Defining and Mitigating AI Deception Open
Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the …
View article: Characterising Decision Theories with Mechanised Causal Graphs
Characterising Decision Theories with Mechanised Causal Graphs Open
How should my own decisions affect my beliefs about the outcomes I expect to achieve? If taking a certain action makes me view myself as a certain type of person, it might affect how I think others view me, and how I view others who are si…
View article: Discovering agents
Discovering agents Open
Causal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is non-trivial – often the causal model is just assumed by the modeller without much justification – and modelling fai…
View article: Human Control: Definitions and Algorithms
Human Control: Definitions and Algorithms Open
How can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. In this paper, we…
View article: Reasoning about causality in games
Reasoning about causality in games Open
Causal reasoning and game-theoretic reasoning are fundamental topics in artificial intelligence, among many other disciplines: this paper is concerned with their intersection. Despite their importance, a formal framework that supports both…
View article: Reasoning about Causality in Games
Reasoning about Causality in Games Open
Causal reasoning and game-theoretic reasoning are fundamental topics in artificial intelligence, among many other disciplines: this paper is concerned with their intersection. Despite their importance, a formal framework that supports both…
View article: Discovering Agents
Discovering Agents Open
Causal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is non-trivial -- often the causal model is just assumed by the modeler without much justification -- and modelling fa…
View article: Path-Specific Objectives for Safer Agent Incentives
Path-Specific Objectives for Safer Agent Incentives Open
We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected re…
View article: Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness
Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness Open
In addition to reproducing discriminatory relationships in the training data, machine learning (ML) systems can also introduce or amplify discriminatory effects. We refer to this as introduced unfairness, and investigate the conditions und…
View article: A Complete Criterion for Value of Information in Soluble Influence Diagrams
A Complete Criterion for Value of Information in Soluble Influence Diagrams Open
Influence diagrams have recently been used to analyse the safety and fairness properties of AI systems. A key building block for this analysis is a graphical criterion for value of information (VoI). This paper establishes the first comple…
View article: Path-Specific Objectives for Safer Agent Incentives
Path-Specific Objectives for Safer Agent Incentives Open
We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected re…
View article: A Complete Criterion for Value of Information in Soluble Influence Diagrams
A Complete Criterion for Value of Information in Soluble Influence Diagrams Open
Influence diagrams have recently been used to analyse the safety and fairness properties of AI systems. A key building block for this analysis is a graphical criterion for value of information (VoI). This paper establishes the first comple…
View article: Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness
Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness Open
In addition to reproducing discriminatory relationships in the training data, machine learning systems can also introduce or amplify discriminatory effects. We refer to this as introduced unfairness, and investigate the conditions under wh…
View article: Classification by decomposition: a novel approach to classification of symmetric $$2\times 2$$ games
Classification by decomposition: a novel approach to classification of symmetric $$2\times 2$$ games Open
View article: Shaking the foundations: delusions in sequence models for interaction and control
Shaking the foundations: delusions in sequence models for interaction and control Open
The recent phenomenal success of language models has reinvigorated machine learning research, and large sequence models such as transformers are being applied to a variety of domains. One important problem class that has remained relativel…
View article: Shaking the foundations: delusions in sequence models for interaction\n and control
Shaking the foundations: delusions in sequence models for interaction\n and control Open
The recent phenomenal success of language models has reinvigorated machine\nlearning research, and large sequence models such as transformers are being\napplied to a variety of domains. One important problem class that has remained\nrelati…
View article: Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective
Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective Open
View article: Agent Incentives: A Causal Perspective
Agent Incentives: A Causal Perspective Open
We present a framework for analysing agent incentives using causal influence diagrams. We establish that a well-known criterion for value of information is complete. We propose a new graphical criterion for value of control, establishing i…
View article: How RL Agents Behave When Their Actions Are Modified
How RL Agents Behave When Their Actions Are Modified Open
Reinforcement learning in complex environments may require supervision to prevent the agent from attempting dangerous actions. As a result of supervisor intervention, the executed action may differ from the action specified by the policy. …
View article: Alignment of Language Agents
Alignment of Language Agents Open
For artificial intelligence to be beneficial to humans the behaviour of AI agents needs to be aligned with what humans want. In this paper we discuss some behavioural issues for language agents, arising from accidental misspecification by …
View article: Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice Open
Multi-agent influence diagrams (MAIDs) are a popular form of graphical model that, for certain classes of games, have been shown to offer key complexity and explainability advantages over traditional extensive form game (EFG) representatio…
View article: Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and\n Practice
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and\n Practice Open
Multi-agent influence diagrams (MAIDs) are a popular form of graphical model\nthat, for certain classes of games, have been shown to offer key complexity and\nexplainability advantages over traditional extensive form game (EFG)\nrepresenta…