Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Exploring foci of: arXiv (Cornell University) Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment February 2024 • Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Muhao Chen, Junjie Hu, Yixuan Li, Bo Li, Chaowei Xiao Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential … Open Article Page

Computer Science Computer Security Business Open Article