Jiongxiao Wang
YOU?
Author Swipe
View article: Robust Representation Consistency Model via Contrastive Denoising
Robust Representation Consistency Model via Contrastive Denoising Open
Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffu…
View article: Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations Open
View article: Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Open
Machine unlearning has emerged as an effective strategy for forgetting specific information in the training data. However, with the increasing integration of visual data, privacy concerns in Vision Language Models (VLMs) remain underexplor…
View article: FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks
FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks Open
Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant se…
View article: Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness
Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness Open
Diffusion Purification, purifying noised images with diffusion models, has been widely used for enhancing certified robustness via randomized smoothing. However, existing frameworks often grapple with the balance between efficiency and eff…
View article: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment Open
Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, par…
View article: Preference Poisoning Attacks on Reward Model Learning
Preference Poisoning Attacks on Reward Model Learning Open
Learning reward models from pairwise comparisons is a fundamental component in a number of domains, including autonomous control, conversational agents, and recommendation systems, as part of a broad goal of aligning automated decisions wi…
View article: RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models Open
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators …
View article: Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations Open
Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes pronounced in the context of LLMs deployed as Web Services, which typically of…
View article: On the Exploitability of Instruction Tuning
On the Exploitability of Instruction Tuning Open
Instruction tuning is an effective technique to align large language models (LLMs) with human intents. In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into t…
View article: ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback
ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback Open
Recent advancements in conversational large language models (LLMs), such as ChatGPT, have demonstrated remarkable promise in various domains, including drug discovery. However, existing works mainly focus on investigating the capabilities …
View article: Adversarial Demonstration Attacks on Large Language Models
Adversarial Demonstration Attacks on Large Language Models Open
With the emergence of more powerful large language models (LLMs), such as ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence in leveraging these models for specific tasks by utilizing data-label pairs as precond…
View article: Defending against Adversarial Audio via Diffusion Model
Defending against Adversarial Audio via Diffusion Model Open
Deep learning models have been widely used in commercial acoustic systems in recent years. However, adversarial audio examples can cause abnormal behaviors for those acoustic systems, while being hard for humans to perceive. Various method…
View article: DensePure: Understanding Diffusion Models towards Adversarial Robustness
DensePure: Understanding Diffusion Models towards Adversarial Robustness Open
Diffusion models have been recently employed to improve certified robustness through the process of denoising. However, the theoretical understanding of why diffusion models are able to improve the certified robustness is still lacking, pr…
View article: Fast and Reliable Evaluation of Adversarial Robustness with Minimum-Margin Attack
Fast and Reliable Evaluation of Adversarial Robustness with Minimum-Margin Attack Open
The AutoAttack (AA) has been the most reliable method to evaluate adversarial robustness when considerable computational resources are available. However, the high computational cost (e.g., 100 times more than that of the project gradient …