Explanipedia

Towards Robust Mathematical Reasoning Open

Daniel Duck-Jin Hwang, Yuri Chervonyi, I. J. Seo, Junsu Kim, Garrett Bingham , et al. · 2025

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answe…

Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning Open

Pulkit Verma, Ngoc La, Swaroop Mishra, Julie A. Shah · 2025

Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, yet their ability to perform structured symbolic planning remains limited, particularly in domains requiring formal representations like the Plann…

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving Open

Mihir Parmar, Xin Liu, Palash Goyal, Yanfei Chen, Long Bao Le , et al. · 2025

Recent agent frameworks and inference-time algorithms often struggle with complex planning problems due to limitations in verifying generated plans or reasoning and varying complexity of instances within a single task. Many existing method…

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving Open

Mihir Parmar, Xin Liu, Palash Goyal, Yanfei Chen, Long Bao Le , et al. · 2025

Towards Robust Mathematical Reasoning Open

Thang M. Luong, Daniel Duck-Jin Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi , et al. · 2025

Reverse Thinking Makes LLMs Stronger Reasoners Open

Justin Chen, Zifeng Wang, Hamid Palangi, R.J. Han, Sayna Ebrahimi , et al. · 2025

Reverse Thinking Makes LLMs Stronger Reasoners Open

Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, R.J. Han, Sayna Ebrahimi , et al. · 2024

Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning perf…

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark Open

Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar , et al. · 2024

Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging…

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting Open

Zilong Wang, Zifeng Wang, Long Tan Le, Huaixiu Zheng, Swaroop Mishra , et al. · 2024

Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval …

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning Open

Huaixiu Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen , et al. · 2024

We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full informat…

Cutting Through the Noise: Boosting LLM Performance on Math Word Problems Open

Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral , et al. · 2024

Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adv…

Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses Open

Xiao Ma, Swaroop Mishra, Ariel Liu, Sophie Ying Su, Jilin Chen , et al. · 2024

Large language model (LLM) powered chatbots are primarily text-based today, and impose a large interactional cognitive load, especially for exploratory or sensemaking tasks such as planning a trip or learning about a new city. Because the …

In-Context Principle Learning from Mistakes Open

Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra , et al. · 2024

In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct inpu…

Self-Discover: Large Language Models Self-Compose Reasoning Structures Open

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng , et al. · 2024

We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-disc…

Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses Open

Xiao Ma, Swaroop Mishra, Ariel Liu, Sophie Ying Su, Jilin Chen , et al. · 2023

Large language model (LLM) powered chatbots are primarily text-based today, and impose a large interactional cognitive load, especially for exploratory or sensemaking tasks such as planning a trip or learning about a new city. Because the …

Instruction-Following Evaluation for Large Language Models Open

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu , et al. · 2023

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while L…

TarGEN: Targeted Data Generation with Large Language Models Open

Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar , et al. · 2023

The rapid advancement of large language models (LLMs) has sparked interest in data synthesis techniques, aiming to generate diverse and high-quality synthetic datasets. However, these synthetic datasets often suffer from a lack of diversit…

InstructExcel: A Benchmark for Natural Language Instruction in Excel Open

Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz , et al. · 2023

With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript AP…

AutoMix: Automatically Mixing Language Models Open

Aman Madaan, Pranjal Aggarwal, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra , et al. · 2023

Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and per…

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models Open

Huaixiu Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. , et al. · 2023

We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide rea…

How FaR Are Large Language Models From Agents with Theory-of-Mind? Open

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee , et al. · 2023

"Thinking is for Doing." Humans can infer other people's mental states from observations--an ability called Theory-of-Mind (ToM)--and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi a…

Large Language Models Cannot Self-Correct Reasoning Yet Open

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Zheng, Adams Wei Yu , et al. · 2023

Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their g…

Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning Open

Xiao Ma, Swaroop Mishra, Ahmad Beirami, Alex Beutel, Jilin Chen · 2023

Language models still struggle on moral reasoning, despite their impressive performance in many other tasks. In particular, the Moral Scenarios task in MMLU (Multi-task Language Understanding) is among the worst performing tasks for many l…

Instruction Tuned Models are Quick Learners Open

Himanshu Gupta, Saurabh Arjun Sawant, Swaroop Mishra, Mutsumi Nakamura, Arindam Mitra , et al. · 2023

Instruction tuning of language models has demonstrated the ability to enhance model generalization to unseen tasks via in-context learning using a few examples. However, typical supervised learning still requires a plethora of downstream t…

InstructABSA: Instruction Learning for Aspect Based Sentiment Analysis Open

Kevin Scaria, Himanshu Gupta, Siddharth Goyal, Saurabh Arjun Sawant, Swaroop Mishra , et al. · 2023

We introduce InstructABSA, an instruction learning paradigm for Aspect-Based Sentiment Analysis (ABSA) subtasks. Our method introduces positive, negative, and neutral examples to each training sample, and instruction tune the model (Tk-Ins…

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow Open

Anjana Arunkumar, Swaroop Mishra, Bhavdeep Sachdeva, Chitta Baral, Chris Bryan · 2023

Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel be…

“John is 50 years old, can his son be 65?” Evaluating NLP Models’ Understanding of Feasibility Open

Himanshu Gupta, Neeraj Varshney, Swaroop Mishra, Kuntal Kumar Pal, Saurabh Arjun Sawant , et al. · 2023

Himanshu Gupta, Neeraj Varshney, Swaroop Mishra, Kuntal Kumar Pal, Saurabh Arjun Sawant, Kevin Scaria, Siddharth Goyal, Chitta Baral. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguisti…

InstructExcel: A Benchmark for Natural Language Instruction in Excel Open

Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz , et al. · 2023

Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy, Benjamin Van Durme, Elnaz Nouri. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.

HELP ME THINK: A Simple Prompting Strategy for Non-experts to Create Customized Content with Models Open

Swaroop Mishra, Elnaz Nouri · 2023

Controlling the text generated by language models and customizing the content has been a long-standing challenge. Existing prompting techniques proposed in pursuit of providing control are task-specific and lack generality; this provides o…

Self-Instruct: Aligning Language Models with Self-Generated Instructions Open

Yi‐Zhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith , et al. · 2023

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.

Swaroop Mishra YOU? Author Swipe