Benchmarking and Studying the LLM-based Code Review Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2509.01494
Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWRBench , a new benchmark comprising 1000 manually verified Pull Requests (PRs) from GitHub, offering PR-centric review with full project context. SWRBench employs an objective LLM-based evaluation method that aligns strongly with human judgment (~90 agreement) by verifying if issues from a structured ground truth are covered in generated reviews. Our systematic evaluation of mainstream ACR tools and LLMs on SWRBench reveals that current systems underperform, and ACR tools are more adept at detecting functional errors. Subsequently, we propose and validate a simple multi-review aggregation strategy that significantly boosts ACR performance, increasing F1 scores by up to 43.67%. Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach, offering valuable insights for advancing ACR research.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2509.01494
- https://arxiv.org/pdf/2509.01494
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416692581
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416692581Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2509.01494Digital Object Identifier
- Title
-
Benchmarking and Studying the LLM-based Code ReviewWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-09-01Full publication date if available
- Authors
-
R. S. Shi, Yixin Li, Kang Sun, Yidong Wang, Rui Xie, Wei Ye, Shikun ZhangList of authors in order
- Landing page
-
https://arxiv.org/abs/2509.01494Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2509.01494Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2509.01494Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416692581 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2509.01494 |
| ids.doi | https://doi.org/10.48550/arxiv.2509.01494 |
| ids.openalex | https://openalex.org/W4416692581 |
| fwci | |
| type | preprint |
| title | Benchmarking and Studying the LLM-based Code Review |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2509.01494 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2509.01494 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2509.01494 |
| locations[1].id | doi:10.48550/arxiv.2509.01494 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2509.01494 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5107859614 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | R. S. Shi |
| authorships[0].author_position | middle |
| authorships[0].raw_author_name | Shi, Ruikai |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100459848 |
| authorships[1].author.orcid | https://orcid.org/0009-0000-9663-9324 |
| authorships[1].author.display_name | Yixin Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Yixin |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5058424504 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-8971-5114 |
| authorships[2].author.display_name | Kang Sun |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Sun, Kaicheng |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100685716 |
| authorships[3].author.orcid | https://orcid.org/0009-0007-9969-8259 |
| authorships[3].author.display_name | Yidong Wang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wang, Yidong |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5078075179 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-2857-7480 |
| authorships[4].author.display_name | Rui Xie |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Xie, Rui |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5100448683 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-3784-7788 |
| authorships[5].author.display_name | Wei Ye |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Ye, Wei |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5058096102 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-6628-4881 |
| authorships[6].author.display_name | Shikun Zhang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Zhang, Shikun |
| authorships[6].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2509.01494 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Benchmarking and Studying the LLM-based Code Review |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T20:48:59.438385 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2509.01494 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2509.01494 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2509.01494 |
| primary_location.id | pmh:oai:arXiv.org:2509.01494 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2509.01494 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2509.01494 |
| publication_date | 2025-09-01 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index., | 51 |
| abstract_inverted_index.a | 52, 91, 131, 158 |
| abstract_inverted_index.F1 | 142 |
| abstract_inverted_index.To | 44 |
| abstract_inverted_index.an | 73, 166 |
| abstract_inverted_index.at | 122 |
| abstract_inverted_index.by | 86, 144 |
| abstract_inverted_index.if | 88 |
| abstract_inverted_index.in | 97 |
| abstract_inverted_index.is | 4 |
| abstract_inverted_index.of | 21, 103, 161 |
| abstract_inverted_index.on | 31, 109 |
| abstract_inverted_index.to | 14, 146 |
| abstract_inverted_index.up | 145 |
| abstract_inverted_index.we | 48, 127 |
| abstract_inverted_index.ACR | 105, 117, 139, 163, 175 |
| abstract_inverted_index.Our | 100, 148 |
| abstract_inverted_index.and | 39, 107, 116, 129, 165 |
| abstract_inverted_index.are | 95, 119 |
| abstract_inverted_index.for | 6, 173 |
| abstract_inverted_index.its | 154 |
| abstract_inverted_index.new | 53 |
| abstract_inverted_index.the | 19, 151 |
| abstract_inverted_index.use | 40 |
| abstract_inverted_index.yet | 9 |
| abstract_inverted_index.(~90 | 84 |
| abstract_inverted_index.1000 | 56 |
| abstract_inverted_index.Code | 1 |
| abstract_inverted_index.LLMs | 108 |
| abstract_inverted_index.Pull | 59 |
| abstract_inverted_index.code | 33 |
| abstract_inverted_index.fail | 13 |
| abstract_inverted_index.from | 62, 90 |
| abstract_inverted_index.full | 68 |
| abstract_inverted_index.lack | 35 |
| abstract_inverted_index.more | 120 |
| abstract_inverted_index.that | 78, 112, 136 |
| abstract_inverted_index.with | 67, 81 |
| abstract_inverted_index.(ACR) | 3 |
| abstract_inverted_index.(PRs) | 61 |
| abstract_inverted_index.Large | 23 |
| abstract_inverted_index.adept | 121 |
| abstract_inverted_index.focus | 30 |
| abstract_inverted_index.human | 82 |
| abstract_inverted_index.often | 12 |
| abstract_inverted_index.study | 160 |
| abstract_inverted_index.these | 46 |
| abstract_inverted_index.tools | 106, 118 |
| abstract_inverted_index.truth | 94 |
| abstract_inverted_index.Models | 25 |
| abstract_inverted_index.Review | 2 |
| abstract_inverted_index.aligns | 79 |
| abstract_inverted_index.boosts | 138 |
| abstract_inverted_index.ground | 93 |
| abstract_inverted_index.issues | 89 |
| abstract_inverted_index.method | 77 |
| abstract_inverted_index.modern | 22 |
| abstract_inverted_index.review | 66 |
| abstract_inverted_index.scores | 143 |
| abstract_inverted_index.simple | 132 |
| abstract_inverted_index.units, | 34 |
| abstract_inverted_index.(LLMs). | 26 |
| abstract_inverted_index.43.67%. | 147 |
| abstract_inverted_index.Current | 27 |
| abstract_inverted_index.GitHub, | 63 |
| abstract_inverted_index.address | 45 |
| abstract_inverted_index.covered | 96 |
| abstract_inverted_index.crucial | 5 |
| abstract_inverted_index.current | 113, 162 |
| abstract_inverted_index.employs | 72 |
| abstract_inverted_index.errors. | 125 |
| abstract_inverted_index.include | 150 |
| abstract_inverted_index.method, | 157 |
| abstract_inverted_index.project | 37, 69 |
| abstract_inverted_index.propose | 128 |
| abstract_inverted_index.reflect | 15 |
| abstract_inverted_index.reveals | 111 |
| abstract_inverted_index.systems | 114 |
| abstract_inverted_index.Language | 24 |
| abstract_inverted_index.Requests | 60 |
| abstract_inverted_index.SWRBench | 50, 71, 110, 152 |
| abstract_inverted_index.complete | 36 |
| abstract_inverted_index.context, | 38 |
| abstract_inverted_index.context. | 70 |
| abstract_inverted_index.existing | 10 |
| abstract_inverted_index.insights | 172 |
| abstract_inverted_index.judgment | 83 |
| abstract_inverted_index.manually | 57 |
| abstract_inverted_index.metrics. | 43 |
| abstract_inverted_index.offering | 64, 170 |
| abstract_inverted_index.quality, | 8 |
| abstract_inverted_index.reviews. | 99 |
| abstract_inverted_index.software | 7 |
| abstract_inverted_index.strategy | 135 |
| abstract_inverted_index.strongly | 80 |
| abstract_inverted_index.validate | 130 |
| abstract_inverted_index.valuable | 171 |
| abstract_inverted_index.verified | 58 |
| abstract_inverted_index.Automated | 0 |
| abstract_inverted_index.LLM-based | 75 |
| abstract_inverted_index.advancing | 174 |
| abstract_inverted_index.approach, | 169 |
| abstract_inverted_index.benchmark | 54 |
| abstract_inverted_index.detecting | 123 |
| abstract_inverted_index.effective | 167 |
| abstract_inverted_index.generated | 98 |
| abstract_inverted_index.hindering | 18 |
| abstract_inverted_index.introduce | 49 |
| abstract_inverted_index.objective | 74, 155 |
| abstract_inverted_index.research. | 176 |
| abstract_inverted_index.verifying | 87 |
| abstract_inverted_index.PR-centric | 65 |
| abstract_inverted_index.agreement) | 85 |
| abstract_inverted_index.benchmark, | 153 |
| abstract_inverted_index.benchmarks | 11, 28 |
| abstract_inverted_index.comprising | 55 |
| abstract_inverted_index.evaluation | 20, 42, 76, 102, 156 |
| abstract_inverted_index.frequently | 29 |
| abstract_inverted_index.functional | 124 |
| abstract_inverted_index.inadequate | 41 |
| abstract_inverted_index.increasing | 141 |
| abstract_inverted_index.mainstream | 104 |
| abstract_inverted_index.real-world | 16 |
| abstract_inverted_index.structured | 92 |
| abstract_inverted_index.systematic | 101 |
| abstract_inverted_index.aggregation | 134 |
| abstract_inverted_index.enhancement | 168 |
| abstract_inverted_index.fine-grained | 32 |
| abstract_inverted_index.limitations, | 47 |
| abstract_inverted_index.multi-review | 133 |
| abstract_inverted_index.performance, | 140 |
| abstract_inverted_index.Subsequently, | 126 |
| abstract_inverted_index.capabilities, | 164 |
| abstract_inverted_index.complexities, | 17 |
| abstract_inverted_index.comprehensive | 159 |
| abstract_inverted_index.contributions | 149 |
| abstract_inverted_index.significantly | 137 |
| abstract_inverted_index.underperform, | 115 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 7 |
| citation_normalized_percentile |