arXiv (Cornell University)
Benchmarking and Studying the LLM-based Code Review
September 2025 • R. S. Shi, Yixin Li, Kang Sun, Yidong Wang, Rui Xie, Wei Ye, Shikun Zhang
Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWRBench , a new benchmark comprising 1000 manually verified Pull Requests (PRs) from GitHub, offering PR-centric review with full project contex…