Luke Marks YOU? Author Swipe

Last 10y

Open Invitation to Help Curate This Field & Enhance Impact .ORG

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models Open

Narmeen Oozeer, Luke Marks, Fazl Barez, Aldrin Abdullah · 2025

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in …

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Open

Luke Marks, Alasdair Paren, David Krueger, Fazl Barez · 2024

Computer science Philosophy

Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regul…

Informal Safety Guarantees for Simulated Optimizers Through Extrapolation from Partial Simulations Open

Luke Marks · 2023

Computer science Mathematics

Self-supervised learning is the backbone of state of the art language modeling. It has been argued that training with predictive loss on a self-supervised dataset causes simulators: entities that internally represent possible configuration…

Interpreting Learned Feedback Patterns in Large Language Models Open

Luke Marks, Amir Abdullah, Luna Mendez, Rauno Arike, Philip Torr , et al. · 2023

Computer science Psychology Mathematics

Reinforcement learning from human feedback (RLHF) is widely used to train large language models (LLMs). However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term \textit{Learne…

Creating related items for first view…