Proceedings of the AAAI Conference on Artificial Intelligence • Vol 39 • No 23
Evaluating LLM Reasoning in the Operations Research Domain with ORQA
April 2025 • Mahdi Mostajabdaveh, Timothy T. Yu, S. Dash, Rindra Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, Yong Zhang
In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark, to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark is designed to evaluate whether LLMs can emulate the knowledge and reasoning skills of OR experts when given diverse and complex optimization problems. The dataset, crafted by OR experts, presents real-world optimization problems that require multistep reasoning …