Luke Marks
YOU?
Author Swipe
View article: Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models Open
Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in …
View article: Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Open
Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regul…
View article: Informal Safety Guarantees for Simulated Optimizers Through Extrapolation from Partial Simulations
Informal Safety Guarantees for Simulated Optimizers Through Extrapolation from Partial Simulations Open
Self-supervised learning is the backbone of state of the art language modeling. It has been argued that training with predictive loss on a self-supervised dataset causes simulators: entities that internally represent possible configuration…
View article: Interpreting Learned Feedback Patterns in Large Language Models
Interpreting Learned Feedback Patterns in Large Language Models Open
Reinforcement learning from human feedback (RLHF) is widely used to train large language models (LLMs). However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term \textit{Learne…