Explanipedia

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Open

Richard Ren, Ashwini Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu , et al. · 2025

As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To ad…

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges Open

Clinton J. Wang, Deokjung Lee, Cristina Menghini, Johannes Mols, Jack Doughty , et al. · 2025

As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reason…

Jailbreaking to Jailbreak Open

Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi , et al. · 2025

Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-trai…

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs Open

Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Hernandez Cardona , et al. · 2025

We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies fou…

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs Open

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona , et al. · 2025

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents Open

Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Thuy Anh Trinh, Scale Red Team , et al. · 2024

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat c…

Planning In Natural Language Improves LLM Search For Code Generation Open

Evan Wang, Federico Cassano, Catherine J. Wu, Yunfeng Bai, Wenze Song , et al. · 2024

While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs…

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Open

Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside , et al. · 2024

Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single t…

A Careful Examination of Large Language Model Performance on Grade School Arithmetic Open

Hugh Zhang, Jeff Da, Dean A. Lee, Vaughn Robinson, Catherine J. Wu , et al. · 2024

Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resemb…

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Open

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel C. Berrios , et al. · 2024

The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, gov…

Scalability and Generalization of Circuit Training for Chip Floorplanning Open

Summer Yue, Ebrahim M. Songhori, Joe Wenjie Jiang, Toby Boyd, Anna Goldie , et al. · 2022

Chip floorplanning is a complex task within the physical design process, with more than six decades of research dedicated to it. In a recent paper published in Nature~\citemirhoseini2021graph, a new methodology based on deep reinforcement …

RL-DARTS: Differentiable Architecture Search for Reinforcement Learning. Open

Yingjie Miao, Xingyou Song, Daiyi Peng, Summer Yue, Eugene Brevdo , et al. · 2021

We introduce RL-DARTS, one of the first applications of Differentiable Architecture Search (DARTS) in reinforcement learning (RL) to search for convolutional cells, applied to the Procgen benchmark. We outline the initial difficulties of a…

Differentiable Architecture Search for Reinforcement Learning Open

Yingjie Miao, Xingyou Song, John D. Co-Reyes, Daiyi Peng, Summer Yue , et al. · 2021

In this paper, we investigate the fundamental question: To what extent are gradient-based neural architecture search (NAS) techniques applicable to RL? Using the original DARTS as a convenient baseline, we discover that the discrete archit…

Summer Yue YOU? Author Swipe