Evaluating Agentic AI Systems: A Balanced Framework for Performance, Robustness, Safety and Beyond Article Swipe
Agentic artificial intelligence (AI)—multi-agent systems that combine large language models with external tools and autonomous planning—are rapidly transitioning from research labs into high-stakes domains. Existing evaluations emphasise narrow technical metrics such as task success or latency, leaving important sociotechnical dimensions like human trust, ethical compliance and economic sustainability under-measured. We propose a balanced evaluation framework spanning five axes (capability&efficiency, robustness& adaptability, safetyðics, human-centred interaction and economic&sustainability) and introduce novel indicators including goal-drift scores and harm-reduction indices. Beyond synthesising prior work, we identify gaps in current benchmarks, develop a conceptual diagram to visualise interdependencies and outline experimental protocols for empirically validating the framework. Case studies from recent industry deployments illustrate that agentic AI can yield 20–60 % productivity gains yet often omit assessments of fairness, trust and long-term sustainability. We argue that multidimensional evaluation—combining automated metrics with human-in-the-loop scoring and economic analysis—is essential for responsible adoption of agentic AI.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- https://doi.org/10.31224/5195
- https://engrxiv.org/preprint/download/5195/8773/7304
- OA Status
- gold
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4413658069
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4413658069Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.31224/5195Digital Object Identifier
- Title
-
Evaluating Agentic AI Systems: A Balanced Framework for Performance, Robustness, Safety and BeyondWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-08-26Full publication date if available
- Authors
-
Manish ShuklaList of authors in order
- Landing page
-
https://doi.org/10.31224/5195Publisher landing page
- PDF URL
-
https://engrxiv.org/preprint/download/5195/8773/7304Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
goldOpen access status per OpenAlex
- OA URL
-
https://engrxiv.org/preprint/download/5195/8773/7304Direct OA link when available
- Concepts
-
Robustness (evolution), Computer science, Risk analysis (engineering), Business, Biochemistry, Chemistry, GeneTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4413658069 |
|---|---|
| doi | https://doi.org/10.31224/5195 |
| ids.doi | https://doi.org/10.31224/5195 |
| ids.openalex | https://openalex.org/W4413658069 |
| fwci | 0.0 |
| type | preprint |
| title | Evaluating Agentic AI Systems: A Balanced Framework for Performance, Robustness, Safety and Beyond |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10876 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.5967000126838684 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2207 |
| topics[0].subfield.display_name | Control and Systems Engineering |
| topics[0].display_name | Fault Detection and Control Systems |
| topics[1].id | https://openalex.org/T10906 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.5687999725341797 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | AI-based Problem Solving and Planning |
| topics[2].id | https://openalex.org/T12761 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.4968999922275543 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Data Stream Mining Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C63479239 |
| concepts[0].level | 3 |
| concepts[0].score | 0.6866177320480347 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q7353546 |
| concepts[0].display_name | Robustness (evolution) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5356684923171997 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C112930515 |
| concepts[2].level | 1 |
| concepts[2].score | 0.34804418683052063 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q4389547 |
| concepts[2].display_name | Risk analysis (engineering) |
| concepts[3].id | https://openalex.org/C144133560 |
| concepts[3].level | 0 |
| concepts[3].score | 0.25113028287887573 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q4830453 |
| concepts[3].display_name | Business |
| concepts[4].id | https://openalex.org/C55493867 |
| concepts[4].level | 1 |
| concepts[4].score | 0.0 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q7094 |
| concepts[4].display_name | Biochemistry |
| concepts[5].id | https://openalex.org/C185592680 |
| concepts[5].level | 0 |
| concepts[5].score | 0.0 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q2329 |
| concepts[5].display_name | Chemistry |
| concepts[6].id | https://openalex.org/C104317684 |
| concepts[6].level | 2 |
| concepts[6].score | 0.0 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q7187 |
| concepts[6].display_name | Gene |
| keywords[0].id | https://openalex.org/keywords/robustness |
| keywords[0].score | 0.6866177320480347 |
| keywords[0].display_name | Robustness (evolution) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.5356684923171997 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/risk-analysis |
| keywords[2].score | 0.34804418683052063 |
| keywords[2].display_name | Risk analysis (engineering) |
| keywords[3].id | https://openalex.org/keywords/business |
| keywords[3].score | 0.25113028287887573 |
| keywords[3].display_name | Business |
| language | en |
| locations[0].id | doi:10.31224/5195 |
| locations[0].is_oa | True |
| locations[0].source | |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://engrxiv.org/preprint/download/5195/8773/7304 |
| locations[0].version | acceptedVersion |
| locations[0].raw_type | posted-content |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | True |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.31224/5195 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5101737653 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-4867-3530 |
| authorships[0].author.display_name | Manish Shukla |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Manish Shukla |
| authorships[0].is_corresponding | True |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://engrxiv.org/preprint/download/5195/8773/7304 |
| open_access.oa_status | gold |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Evaluating Agentic AI Systems: A Balanced Framework for Performance, Robustness, Safety and Beyond |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T10876 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.5967000126838684 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2207 |
| primary_topic.subfield.display_name | Control and Systems Engineering |
| primary_topic.display_name | Fault Detection and Control Systems |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.31224/5195 |
| best_oa_location.is_oa | True |
| best_oa_location.source | |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://engrxiv.org/preprint/download/5195/8773/7304 |
| best_oa_location.version | acceptedVersion |
| best_oa_location.raw_type | posted-content |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.31224/5195 |
| primary_location.id | doi:10.31224/5195 |
| primary_location.is_oa | True |
| primary_location.source | |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://engrxiv.org/preprint/download/5195/8773/7304 |
| primary_location.version | acceptedVersion |
| primary_location.raw_type | posted-content |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | True |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.31224/5195 |
| publication_date | 2025-08-26 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.% | 115 |
| abstract_inverted_index.a | 51, 87 |
| abstract_inverted_index.AI | 111 |
| abstract_inverted_index.We | 49, 128 |
| abstract_inverted_index.as | 31 |
| abstract_inverted_index.in | 83 |
| abstract_inverted_index.of | 122, 145 |
| abstract_inverted_index.or | 34 |
| abstract_inverted_index.to | 90 |
| abstract_inverted_index.we | 80 |
| abstract_inverted_index.AI. | 147 |
| abstract_inverted_index.and | 13, 45, 64, 66, 73, 93, 125, 138 |
| abstract_inverted_index.can | 112 |
| abstract_inverted_index.for | 97, 142 |
| abstract_inverted_index.the | 100 |
| abstract_inverted_index.yet | 118 |
| abstract_inverted_index.Case | 102 |
| abstract_inverted_index.axes | 57 |
| abstract_inverted_index.five | 56 |
| abstract_inverted_index.from | 18, 104 |
| abstract_inverted_index.gaps | 82 |
| abstract_inverted_index.into | 21 |
| abstract_inverted_index.labs | 20 |
| abstract_inverted_index.like | 40 |
| abstract_inverted_index.omit | 120 |
| abstract_inverted_index.such | 30 |
| abstract_inverted_index.task | 32 |
| abstract_inverted_index.that | 5, 109, 130 |
| abstract_inverted_index.with | 10, 135 |
| abstract_inverted_index.argue | 129 |
| abstract_inverted_index.gains | 117 |
| abstract_inverted_index.human | 41 |
| abstract_inverted_index.large | 7 |
| abstract_inverted_index.novel | 68 |
| abstract_inverted_index.often | 119 |
| abstract_inverted_index.prior | 78 |
| abstract_inverted_index.tools | 12 |
| abstract_inverted_index.trust | 124 |
| abstract_inverted_index.work, | 79 |
| abstract_inverted_index.yield | 113 |
| abstract_inverted_index.Beyond | 76 |
| abstract_inverted_index.models | 9 |
| abstract_inverted_index.narrow | 27 |
| abstract_inverted_index.recent | 105 |
| abstract_inverted_index.scores | 72 |
| abstract_inverted_index.trust, | 42 |
| abstract_inverted_index.20–60 | 114 |
| abstract_inverted_index.Agentic | 0 |
| abstract_inverted_index.agentic | 110, 146 |
| abstract_inverted_index.combine | 6 |
| abstract_inverted_index.current | 84 |
| abstract_inverted_index.develop | 86 |
| abstract_inverted_index.diagram | 89 |
| abstract_inverted_index.ethical | 43 |
| abstract_inverted_index.leaving | 36 |
| abstract_inverted_index.metrics | 29, 134 |
| abstract_inverted_index.outline | 94 |
| abstract_inverted_index.propose | 50 |
| abstract_inverted_index.rapidly | 16 |
| abstract_inverted_index.scoring | 137 |
| abstract_inverted_index.studies | 103 |
| abstract_inverted_index.success | 33 |
| abstract_inverted_index.systems | 4 |
| abstract_inverted_index.Existing | 24 |
| abstract_inverted_index.adoption | 144 |
| abstract_inverted_index.balanced | 52 |
| abstract_inverted_index.domains. | 23 |
| abstract_inverted_index.economic | 46, 139 |
| abstract_inverted_index.external | 11 |
| abstract_inverted_index.identify | 81 |
| abstract_inverted_index.indices. | 75 |
| abstract_inverted_index.industry | 106 |
| abstract_inverted_index.language | 8 |
| abstract_inverted_index.latency, | 35 |
| abstract_inverted_index.research | 19 |
| abstract_inverted_index.spanning | 55 |
| abstract_inverted_index.automated | 133 |
| abstract_inverted_index.emphasise | 26 |
| abstract_inverted_index.essential | 141 |
| abstract_inverted_index.fairness, | 123 |
| abstract_inverted_index.framework | 54 |
| abstract_inverted_index.important | 37 |
| abstract_inverted_index.including | 70 |
| abstract_inverted_index.introduce | 67 |
| abstract_inverted_index.long-term | 126 |
| abstract_inverted_index.protocols | 96 |
| abstract_inverted_index.technical | 28 |
| abstract_inverted_index.visualise | 91 |
| abstract_inverted_index.artificial | 1 |
| abstract_inverted_index.autonomous | 14 |
| abstract_inverted_index.compliance | 44 |
| abstract_inverted_index.conceptual | 88 |
| abstract_inverted_index.dimensions | 39 |
| abstract_inverted_index.evaluation | 53 |
| abstract_inverted_index.framework. | 101 |
| abstract_inverted_index.goal-drift | 71 |
| abstract_inverted_index.illustrate | 108 |
| abstract_inverted_index.indicators | 69 |
| abstract_inverted_index.validating | 99 |
| abstract_inverted_index.assessments | 121 |
| abstract_inverted_index.benchmarks, | 85 |
| abstract_inverted_index.deployments | 107 |
| abstract_inverted_index.empirically | 98 |
| abstract_inverted_index.evaluations | 25 |
| abstract_inverted_index.high-stakes | 22 |
| abstract_inverted_index.interaction | 63 |
| abstract_inverted_index.responsible | 143 |
| abstract_inverted_index.experimental | 95 |
| abstract_inverted_index.intelligence | 2 |
| abstract_inverted_index.productivity | 116 |
| abstract_inverted_index.safetyðics, | 61 |
| abstract_inverted_index.synthesising | 77 |
| abstract_inverted_index.adaptability, | 60 |
| abstract_inverted_index.analysis—is | 140 |
| abstract_inverted_index.human-centred | 62 |
| abstract_inverted_index.transitioning | 17 |
| abstract_inverted_index.harm-reduction | 74 |
| abstract_inverted_index.planning—are | 15 |
| abstract_inverted_index.sociotechnical | 38 |
| abstract_inverted_index.sustainability | 47 |
| abstract_inverted_index.robustness& | 59 |
| abstract_inverted_index.sustainability. | 127 |
| abstract_inverted_index.under-measured. | 48 |
| abstract_inverted_index.multidimensional | 131 |
| abstract_inverted_index.human-in-the-loop | 136 |
| abstract_inverted_index.interdependencies | 92 |
| abstract_inverted_index.(AI)—multi-agent | 3 |
| abstract_inverted_index.evaluation—combining | 132 |
| abstract_inverted_index.(capability&efficiency, | 58 |
| abstract_inverted_index.economic&sustainability) | 65 |
| cited_by_percentile_year | |
| corresponding_author_ids | https://openalex.org/A5101737653 |
| countries_distinct_count | 0 |
| institutions_distinct_count | 1 |
| citation_normalized_percentile.value | 0.47220362 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | False |