PolicySimEval: A Benchmark for Evaluating Policy Outcomes through Agent-Based Simulation Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2502.07853
With the growing adoption of agent-based models in policy evaluation, a pressing question arises: Can such systems effectively simulate and analyze complex social scenarios to inform policy decisions? Addressing this challenge could significantly enhance the policy-making process, offering researchers and practitioners a systematic way to validate, explore, and refine policy outcomes. To advance this goal, we introduce PolicySimEval, the first benchmark designed to evaluate the capability of agent-based simulations in policy assessment tasks. PolicySimEval aims to reflect the real-world complexities faced by social scientists and policymakers. The benchmark is composed of three categories of evaluation tasks: (1) 20 comprehensive scenarios that replicate end-to-end policy modeling challenges, complete with annotated expert solutions; (2) 65 targeted sub-tasks that address specific aspects of agent-based simulation (e.g., agent behavior calibration); and (3) 200 auto-generated tasks to enable large-scale evaluation and method development. Experiments show that current state-of-the-art frameworks struggle to tackle these tasks effectively, with the highest-performing system achieving only 24.5\% coverage rate on comprehensive scenarios, 15.04\% on sub-tasks, and 14.5\% on auto-generated tasks. These results highlight the difficulty of the task and the gap between current capabilities and the requirements for real-world policy evaluation.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2502.07853
- https://arxiv.org/pdf/2502.07853
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4407569433
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4407569433Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2502.07853Digital Object Identifier
- Title
-
PolicySimEval: A Benchmark for Evaluating Policy Outcomes through Agent-Based SimulationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-02-11Full publication date if available
- Authors
-
Jiaju Kang, Peide Han, Tian Zhang, L. GongList of authors in order
- Landing page
-
https://arxiv.org/abs/2502.07853Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2502.07853Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2502.07853Direct OA link when available
- Concepts
-
Benchmark (surveying), Computer science, Geography, CartographyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4407569433 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2502.07853 |
| ids.doi | https://doi.org/10.48550/arxiv.2502.07853 |
| ids.openalex | https://openalex.org/W4407569433 |
| fwci | |
| type | preprint |
| title | PolicySimEval: A Benchmark for Evaluating Policy Outcomes through Agent-Based Simulation |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T14509 |
| topics[0].field.id | https://openalex.org/fields/18 |
| topics[0].field.display_name | Decision Sciences |
| topics[0].score | 0.3151000142097473 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1803 |
| topics[0].subfield.display_name | Management Science and Operations Research |
| topics[0].display_name | demographic modeling and climate adaptation |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C185798385 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8645820617675781 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1161707 |
| concepts[0].display_name | Benchmark (surveying) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5243856310844421 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C205649164 |
| concepts[2].level | 0 |
| concepts[2].score | 0.07767564058303833 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[2].display_name | Geography |
| concepts[3].id | https://openalex.org/C58640448 |
| concepts[3].level | 1 |
| concepts[3].score | 0.05154135823249817 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q42515 |
| concepts[3].display_name | Cartography |
| keywords[0].id | https://openalex.org/keywords/benchmark |
| keywords[0].score | 0.8645820617675781 |
| keywords[0].display_name | Benchmark (surveying) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.5243856310844421 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/geography |
| keywords[2].score | 0.07767564058303833 |
| keywords[2].display_name | Geography |
| keywords[3].id | https://openalex.org/keywords/cartography |
| keywords[3].score | 0.05154135823249817 |
| keywords[3].display_name | Cartography |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2502.07853 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2502.07853 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2502.07853 |
| locations[1].id | doi:10.48550/arxiv.2502.07853 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2502.07853 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5057685187 |
| authorships[0].author.orcid | https://orcid.org/0009-0009-8026-4332 |
| authorships[0].author.display_name | Jiaju Kang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Kang, Jiaju |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5045907811 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-1525-6477 |
| authorships[1].author.display_name | Peide Han |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Han, Puyu |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100371729 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-1284-1232 |
| authorships[2].author.display_name | Tian Zhang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Zhang, Tian |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5107924169 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | L. Gong |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Gong, Luqi |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2502.07853 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | PolicySimEval: A Benchmark for Evaluating Policy Outcomes through Agent-Based Simulation |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T14509 |
| primary_topic.field.id | https://openalex.org/fields/18 |
| primary_topic.field.display_name | Decision Sciences |
| primary_topic.score | 0.3151000142097473 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1803 |
| primary_topic.subfield.display_name | Management Science and Operations Research |
| primary_topic.display_name | demographic modeling and climate adaptation |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2378211422, https://openalex.org/W4321353415, https://openalex.org/W2745001401, https://openalex.org/W2130974462, https://openalex.org/W2028665553, https://openalex.org/W2086519370, https://openalex.org/W4246352526 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2502.07853 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2502.07853 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2502.07853 |
| primary_location.id | pmh:oai:arXiv.org:2502.07853 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2502.07853 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2502.07853 |
| publication_date | 2025-02-11 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 10, 41 |
| abstract_inverted_index.20 | 97 |
| abstract_inverted_index.65 | 112 |
| abstract_inverted_index.To | 51 |
| abstract_inverted_index.by | 81 |
| abstract_inverted_index.in | 7, 69 |
| abstract_inverted_index.is | 88 |
| abstract_inverted_index.of | 4, 66, 90, 93, 119, 175 |
| abstract_inverted_index.on | 159, 163, 167 |
| abstract_inverted_index.to | 24, 44, 62, 75, 131, 145 |
| abstract_inverted_index.we | 55 |
| abstract_inverted_index.(1) | 96 |
| abstract_inverted_index.(2) | 111 |
| abstract_inverted_index.(3) | 127 |
| abstract_inverted_index.200 | 128 |
| abstract_inverted_index.Can | 14 |
| abstract_inverted_index.The | 86 |
| abstract_inverted_index.and | 19, 39, 47, 84, 126, 135, 165, 178, 184 |
| abstract_inverted_index.for | 187 |
| abstract_inverted_index.gap | 180 |
| abstract_inverted_index.the | 1, 34, 58, 64, 77, 151, 173, 176, 179, 185 |
| abstract_inverted_index.way | 43 |
| abstract_inverted_index.With | 0 |
| abstract_inverted_index.aims | 74 |
| abstract_inverted_index.only | 155 |
| abstract_inverted_index.rate | 158 |
| abstract_inverted_index.show | 139 |
| abstract_inverted_index.such | 15 |
| abstract_inverted_index.task | 177 |
| abstract_inverted_index.that | 100, 115, 140 |
| abstract_inverted_index.this | 29, 53 |
| abstract_inverted_index.with | 107, 150 |
| abstract_inverted_index.These | 170 |
| abstract_inverted_index.agent | 123 |
| abstract_inverted_index.could | 31 |
| abstract_inverted_index.faced | 80 |
| abstract_inverted_index.first | 59 |
| abstract_inverted_index.goal, | 54 |
| abstract_inverted_index.tasks | 130, 148 |
| abstract_inverted_index.these | 147 |
| abstract_inverted_index.three | 91 |
| abstract_inverted_index.(e.g., | 122 |
| abstract_inverted_index.14.5\% | 166 |
| abstract_inverted_index.24.5\% | 156 |
| abstract_inverted_index.enable | 132 |
| abstract_inverted_index.expert | 109 |
| abstract_inverted_index.inform | 25 |
| abstract_inverted_index.method | 136 |
| abstract_inverted_index.models | 6 |
| abstract_inverted_index.policy | 8, 26, 49, 70, 103, 189 |
| abstract_inverted_index.refine | 48 |
| abstract_inverted_index.social | 22, 82 |
| abstract_inverted_index.system | 153 |
| abstract_inverted_index.tackle | 146 |
| abstract_inverted_index.tasks. | 72, 169 |
| abstract_inverted_index.tasks: | 95 |
| abstract_inverted_index.15.04\% | 162 |
| abstract_inverted_index.address | 116 |
| abstract_inverted_index.advance | 52 |
| abstract_inverted_index.analyze | 20 |
| abstract_inverted_index.arises: | 13 |
| abstract_inverted_index.aspects | 118 |
| abstract_inverted_index.between | 181 |
| abstract_inverted_index.complex | 21 |
| abstract_inverted_index.current | 141, 182 |
| abstract_inverted_index.enhance | 33 |
| abstract_inverted_index.growing | 2 |
| abstract_inverted_index.reflect | 76 |
| abstract_inverted_index.results | 171 |
| abstract_inverted_index.systems | 16 |
| abstract_inverted_index.adoption | 3 |
| abstract_inverted_index.behavior | 124 |
| abstract_inverted_index.complete | 106 |
| abstract_inverted_index.composed | 89 |
| abstract_inverted_index.coverage | 157 |
| abstract_inverted_index.designed | 61 |
| abstract_inverted_index.evaluate | 63 |
| abstract_inverted_index.explore, | 46 |
| abstract_inverted_index.modeling | 104 |
| abstract_inverted_index.offering | 37 |
| abstract_inverted_index.pressing | 11 |
| abstract_inverted_index.process, | 36 |
| abstract_inverted_index.question | 12 |
| abstract_inverted_index.simulate | 18 |
| abstract_inverted_index.specific | 117 |
| abstract_inverted_index.struggle | 144 |
| abstract_inverted_index.targeted | 113 |
| abstract_inverted_index.achieving | 154 |
| abstract_inverted_index.annotated | 108 |
| abstract_inverted_index.benchmark | 60, 87 |
| abstract_inverted_index.challenge | 30 |
| abstract_inverted_index.highlight | 172 |
| abstract_inverted_index.introduce | 56 |
| abstract_inverted_index.outcomes. | 50 |
| abstract_inverted_index.replicate | 101 |
| abstract_inverted_index.scenarios | 23, 99 |
| abstract_inverted_index.sub-tasks | 114 |
| abstract_inverted_index.validate, | 45 |
| abstract_inverted_index.Addressing | 28 |
| abstract_inverted_index.assessment | 71 |
| abstract_inverted_index.capability | 65 |
| abstract_inverted_index.categories | 92 |
| abstract_inverted_index.decisions? | 27 |
| abstract_inverted_index.difficulty | 174 |
| abstract_inverted_index.end-to-end | 102 |
| abstract_inverted_index.evaluation | 94, 134 |
| abstract_inverted_index.frameworks | 143 |
| abstract_inverted_index.real-world | 78, 188 |
| abstract_inverted_index.scenarios, | 161 |
| abstract_inverted_index.scientists | 83 |
| abstract_inverted_index.simulation | 121 |
| abstract_inverted_index.solutions; | 110 |
| abstract_inverted_index.sub-tasks, | 164 |
| abstract_inverted_index.systematic | 42 |
| abstract_inverted_index.Experiments | 138 |
| abstract_inverted_index.agent-based | 5, 67, 120 |
| abstract_inverted_index.challenges, | 105 |
| abstract_inverted_index.effectively | 17 |
| abstract_inverted_index.evaluation, | 9 |
| abstract_inverted_index.evaluation. | 190 |
| abstract_inverted_index.large-scale | 133 |
| abstract_inverted_index.researchers | 38 |
| abstract_inverted_index.simulations | 68 |
| abstract_inverted_index.capabilities | 183 |
| abstract_inverted_index.complexities | 79 |
| abstract_inverted_index.development. | 137 |
| abstract_inverted_index.effectively, | 149 |
| abstract_inverted_index.requirements | 186 |
| abstract_inverted_index.PolicySimEval | 73 |
| abstract_inverted_index.calibration); | 125 |
| abstract_inverted_index.comprehensive | 98, 160 |
| abstract_inverted_index.policy-making | 35 |
| abstract_inverted_index.policymakers. | 85 |
| abstract_inverted_index.practitioners | 40 |
| abstract_inverted_index.significantly | 32 |
| abstract_inverted_index.PolicySimEval, | 57 |
| abstract_inverted_index.auto-generated | 129, 168 |
| abstract_inverted_index.state-of-the-art | 142 |
| abstract_inverted_index.highest-performing | 152 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |