Fazl Barez
YOU?
Author Swipe
View article: VAL-Bench: Measuring Value Alignment in Language Models
VAL-Bench: Measuring Value Alignment in Language Models Open
Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined sa…
View article: Query Circuits: Explaining How Language Models Answer User Prompts
Query Circuits: Explaining How Language Models Answer User Prompts Open
Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific inp…
View article: Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective Open
Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even when using harmless datasets, thus requiring additional safety measures. We challenge this belief th…
View article: The Singapore Consensus on Global AI Safety Research Priorities
The Singapore Consensus on Global AI Safety Research Priorities Open
Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is ther…
View article: Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models Open
Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in …
View article: Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models
Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models Open
This position paper argues that the prevailing trajectory toward ever larger, more expensive generalist foundation models controlled by a handful of companies limits innovation and constrains progress. We challenge this approach by advocat…
View article: Precise In-Parameter Concept Erasure in Large Language Models
Precise In-Parameter Concept Erasure in Large Language Models Open
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning,…
View article: SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors Open
High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outpu…
View article: In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?
In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate? Open
International cooperation is common in AI research, including between geopolitical rivals. While many experts advocate for greater international cooperation on AI safety to address shared global risks, some view cooperation on AI with susp…
View article: Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness Open
Insensitivity to semantically-preserving variations of prompts (paraphrases) is crucial for reliable behavior and real-world deployment of large language models. However, language models exhibit significant performance degradation when fac…
View article: Do Sparse Autoencoders Generalize? A Case Study of Answerability
Do Sparse Autoencoders Generalize? A Case Study of Answerability Open
Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across d…
View article: Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer Open
Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a q…
View article: Open Problems in Machine Unlearning for AI Safety
Open Problems in Machine Unlearning for AI Safety Open
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlea…
View article: Best-of-N Jailbreaking
Best-of-N Jailbreaking Open
We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such …
View article: Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach Open
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid …
View article: Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Open
Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regul…
View article: PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning Open
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' suscep…
View article: Towards Interpreting Visual Information Processing in Vision-Language Models
Towards Interpreting Visual Information Processing in Vision-Language Models Open
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the lo…
View article: Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders Open
The Universality Hypothesis in large language models (LLMs) claims that different models converge towards similar concept representations in their latent spaces. Providing evidence for this hypothesis would enable researchers to exploit un…
View article: Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models Open
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophistic…
View article: Risks and Opportunities of Open-Source Generative AI
Risks and Opportunities of Open-Source Generative AI Open
Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks…
View article: Visualizing Neural Network Imagination
Visualizing Neural Network Imagination Open
In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) archi…
View article: Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Near to Mid-term Risks and Opportunities of Open-Source Generative AI Open
In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about poten…
View article: Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions Open
Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied indepen…
View article: Understanding Addition and Subtraction in Transformers
Understanding Addition and Subtraction in Transformers Open
Transformers are widely deployed in large language models (LLMs), yet most models still fail on basic arithmetic tasks such as multidigit addition. In contrast, we show that small transformers trained from scratch can solve n-digit additio…
View article: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Open
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptiv…
View article: Large Language Models Relearn Removed Concepts
Large Language Models Relearn Removed Concepts Open
Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investi…
View article: Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions Open
Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied indepen…