Jindong Gu
YOU?
Author Swipe
View article: Reimagining Safety Alignment with An Image
Reimagining Safety Alignment with An Image Open
Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries due to rigid safety mechanisms. These issues are further complicated…
View article: SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding Open
Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI …
View article: TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models
TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models Open
Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world appli…
View article: Can an Individual Manipulate the Collective Decisions of Multi-Agents?
Can an Individual Manipulate the Collective Decisions of Multi-Agents? Open
Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reaso…
View article: Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention
Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention Open
Recent advancements in diffusion-based text-to-image (T2I) models have enabled the generation of high-quality and photorealistic images from text. However, they often exhibit societal biases related to gender, race, and socioeconomic statu…
View article: Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing
Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing Open
Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existe…
View article: A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment Open
The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across vario…
View article: REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites Open
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-com…
View article: FedPop: Federated Population-based Hyperparameter Tuning
FedPop: Federated Population-based Hyperparameter Tuning Open
Federated Learning (FL) is a distributed machine learning (ML) paradigm, in which multiple clients collaboratively train ML models without centralizing their local data. Similar to conventional ML pipelines, the client local optimization a…
View article: On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows Open
Agentic AI workflows (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low. A promising solution is inference-time alignment, which uses extra compute at test time to imp…
View article: Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models Open
Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing…
View article: Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation
Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation Open
Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To add…
View article: Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack
Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack Open
Multimodal Large Language Models (MLLMs), built upon LLMs, have recently gained attention for their capabilities in image recognition and understanding. However, while MLLMs are vulnerable to adversarial attacks, the transferability of the…
View article: PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving Open
Recent agent frameworks and inference-time algorithms often struggle with complex planning problems due to limitations in verifying generated plans or reasoning and varying complexity of instances within a single task. Many existing method…
View article: Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety Open
The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to …
View article: Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models Open
Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large L…
View article: PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving Open
View article: Can an Individual Manipulate the Collective Decisions of Multi-Agents?
Can an Individual Manipulate the Collective Decisions of Multi-Agents? Open
View article: FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings
FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings Open
View article: Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety Open
View article: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models Open
View article: Reimagining Safety Alignment with An Image
Reimagining Safety Alignment with An Image Open
View article: Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation
Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation Open
View article: Text-Guided Camouflaged Object Detection
Text-Guided Camouflaged Object Detection Open
View article: Multimodal Pragmatic Jailbreak on Text-to-image Models
Multimodal Pragmatic Jailbreak on Text-to-image Models Open
View article: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models Open
Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth …
View article: AlignGuard: Scalable Safety Alignment for Text-to-Image Generation
AlignGuard: Scalable Safety Alignment for Text-to-Image Generation Open
Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept rem…
View article: Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models
Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models Open
Current image generation models can effortlessly produce high-quality, highly realistic images, but this also increases the risk of misuse. In various Text-to-Image or Image-to-Image tasks, attackers can generate a series of images contain…
View article: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models Open
Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical sce…
View article: ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos Open
Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrai…