Divij Handa
YOU?
Author Swipe
View article: GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time
GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time Open
Repeated Sampling (RS) is a simple inference-time algorithm that has been shown to improve model performance on complex tasks. Although it is an effective way of scaling inference time, it often struggles to generate diverse solution candi…
View article: BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software Open
Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to…
View article: ThinkTuning: Instilling Cognitive Reflections without Distillation
ThinkTuning: Instilling Cognitive Reflections without Distillation Open
Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows tha…
View article: Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents
Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents Open
Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this pr…
View article: UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization
UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization Open
This paper introduces UnSeenTimeQA, a novel data contamination-free time-sensitive question-answering (TSQA) benchmark. It differs from existing TSQA benchmarks by avoiding web-searchable queries grounded in the real world. We present a se…
View article: ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints
ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints Open
Reasoning about Actions and Change (RAC) has historically played a pivotal role in solving foundational AI problems, such as the frame problem. It has driven advancements in AI fields, such as non-monotonic and commonsense reasoning. RAC r…
View article: When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers
When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers Open
Recent advancements in Large Language Model (LLM) safety have primarily focused on mitigating attacks crafted in natural language or common ciphers (e.g. Base64), which are likely integrated into newer models' safety training. However, we …
View article: Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?
Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions? Open
Pre-training on large corpora of text enables the language models to acquire a vast amount of factual and commonsense knowledge which allows them to achieve remarkable performance on a variety of language understanding tasks. They typicall…