Simon Lermen YOU? Author Swipe

Last 10y

Open Invitation to Help Curate This Field & Enhance Impact .ORG

Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects Open

Fred Heiding, Simon Lermen, Andrew Kao, Bruce Schneier, Arun Vishwanath · 2024

In this paper, we evaluate the capability of large language models to conduct personalized phishing attacks and compare their performance with human experts and AI models from last year. We include four email groups with a combined total o…

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability Open

Simon Lermen, Ondřej Kvapil · 2023

Computer science Psychology Engineering

There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language mode…

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B Open

Simon Lermen, Darren Smith, Jeffrey Ladish · 2023

Computer science Physics Chemistry

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat - a collection of instruction fine-tuned large language models - they invested heavily in safet…

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B Open

P.R. Gade, Simon Lermen, Darren Smith, Jeffrey Ladish · 2023

Computer science Psychology Physics

Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad ac…

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios Open

Teun van der Weij, Simon Lermen, Leon lang · 2023

Computer science Engineering Mathematics

Recently, there has been an increase in interest in evaluating large language models for emergent and dangerous capabilities. Importantly, agents could reason that in some scenarios their goal is better achieved if they are not turned off,…

Creating related items for first view…