Explanipedia

Robust Representation Consistency Model via Contrastive Denoising Open

Jiachen Lei, Julius Berner, Jiongxiao Wang, Zhong‐Zhu Chen, Zhixin Ba , et al. · 2025

Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffu…

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations Open

Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan , et al. · 2025

Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Open

Yingzi Ma, Jiongxiao Wang, Fei Wang, Siyuan Ma, Jiazhao Li , et al. · 2024

Machine unlearning has emerged as an effective strategy for forgetting specific information in the training data. However, with the increasing integration of visual data, privacy concerns in Vision Language Models (VLMs) remain underexplor…

FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks Open

Jiongxiao Wang, Fangzhou Wu, W Li, Jiawei Pan, G. Edward Suh , et al. · 2024

Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant se…

Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness Open

Yiquan Li, Zhong‐Zhu Chen, Kun Jin, Jiongxiao Wang, Bo Li , et al. · 2024

Diffusion Purification, purifying noised images with diffusion models, has been widely used for enhancing certified robustness via randomized smoothing. However, existing frameworks often grapple with the balance between efficiency and eff…

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment Open

Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Muhao Chen , et al. · 2024

Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, par…

Preference Poisoning Attacks on Reward Model Learning Open

Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang , et al. · 2024

Learning reward models from pairwise comparisons is a fundamental component in a number of domains, including autonomous control, conversational agents, and recommendation systems, as part of a broad goal of aligning automated decisions wi…

RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models Open

Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, Chaowei Xiao · 2023

Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators …

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations Open

Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan , et al. · 2023

Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes pronounced in the context of LLMs deployed as Web Services, which typically of…

On the Exploitability of Instruction Tuning Open

Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao , et al. · 2023

Instruction tuning is an effective technique to align large language models (LLMs) with human intents. In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into t…

ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback Open

Shengchao Liu, Jiongxiao Wang, Yijin Yang, Chengpeng Wang, Ling Liu , et al. · 2023

Recent advancements in conversational large language models (LLMs), such as ChatGPT, have demonstrated remarkable promise in various domains, including drug discovery. However, existing works mainly focus on investigating the capabilities …

Adversarial Demonstration Attacks on Large Language Models Open

Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, Chaowei Xiao · 2023

With the emergence of more powerful large language models (LLMs), such as ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence in leveraging these models for specific tasks by utilizing data-label pairs as precond…

Defending against Adversarial Audio via Diffusion Model Open

Shutong Wu, Jiongxiao Wang, Ping Wei, Weili Nie, Chaowei Xiao · 2023

Deep learning models have been widely used in commercial acoustic systems in recent years. However, adversarial audio examples can cause abnormal behaviors for those acoustic systems, while being hard for humans to perceive. Various method…

DensePure: Understanding Diffusion Models towards Adversarial Robustness Open

Chaowei Xiao, Zhong‐Zhu Chen, Kun Jin, Jiongxiao Wang, Weili Nie , et al. · 2022

Diffusion models have been recently employed to improve certified robustness through the process of denoising. However, the theoretical understanding of why diffusion models are able to improve the certified robustness is still lacking, pr…

Fast and Reliable Evaluation of Adversarial Robustness with Minimum-Margin Attack Open

Ruize Gao, Jiongxiao Wang, Kaiwen Zhou, Feng Liu, Binghui Xie , et al. · 2022

The AutoAttack (AA) has been the most reliable method to evaluate adversarial robustness when considerable computational resources are available. However, the high computational cost (e.g., 100 times more than that of the project gradient …

Jiongxiao Wang YOU? Author Swipe