Summer Yue
YOU?
Author Swipe
View article: The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Open
As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To ad…
View article: EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges Open
As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reason…
View article: Jailbreaking to Jailbreak
Jailbreaking to Jailbreak Open
Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-trai…
View article: MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs Open
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies fou…
View article: MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs Open
View article: Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents Open
For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat c…
View article: Planning In Natural Language Improves LLM Search For Code Generation
Planning In Natural Language Improves LLM Search For Code Generation Open
While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs…
View article: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Open
Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single t…
View article: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
A Careful Examination of Large Language Model Performance on Grade School Arithmetic Open
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resemb…
View article: The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Open
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, gov…
View article: Scalability and Generalization of Circuit Training for Chip Floorplanning
Scalability and Generalization of Circuit Training for Chip Floorplanning Open
Chip floorplanning is a complex task within the physical design process, with more than six decades of research dedicated to it. In a recent paper published in Nature~\citemirhoseini2021graph, a new methodology based on deep reinforcement …
View article: RL-DARTS: Differentiable Architecture Search for Reinforcement Learning.
RL-DARTS: Differentiable Architecture Search for Reinforcement Learning. Open
We introduce RL-DARTS, one of the first applications of Differentiable Architecture Search (DARTS) in reinforcement learning (RL) to search for convolutional cells, applied to the Procgen benchmark. We outline the initial difficulties of a…
View article: Differentiable Architecture Search for Reinforcement Learning
Differentiable Architecture Search for Reinforcement Learning Open
In this paper, we investigate the fundamental question: To what extent are gradient-based neural architecture search (NAS) techniques applicable to RL? Using the original DARTS as a convenient baseline, we discover that the discrete archit…