Evaluating Agentic AI Systems: A Balanced Framework for Performance, Robustness, Safety and Beyond Article Swipe
Agentic artificial intelligence (AI)—multi-agent systems that combine large language models with external tools and autonomous planning—are rapidly transitioning from research labs into high-stakes domains. Existing evaluations emphasise narrow technical metrics such as task success or latency, leaving important sociotechnical dimensions like human trust, ethical compliance and economic sustainability under-measured. We propose a balanced evaluation framework spanning five axes (capability&efficiency, robustness& adaptability, safetyðics, human-centred interaction and economic&sustainability) and introduce novel indicators including goal-drift scores and harm-reduction indices. Beyond synthesising prior work, we identify gaps in current benchmarks, develop a conceptual diagram to visualise interdependencies and outline experimental protocols for empirically validating the framework. Case studies from recent industry deployments illustrate that agentic AI can yield 20–60 % productivity gains yet often omit assessments of fairness, trust and long-term sustainability. We argue that multidimensional evaluation—combining automated metrics with human-in-the-loop scoring and economic analysis—is essential for responsible adoption of agentic AI.
Related Topics To Compare & Contrast
- Type
- preprint
- Language
- en
- Landing Page
- https://doi.org/10.31224/5195
- https://engrxiv.org/preprint/download/5195/8773/7304
- OA Status
- gold
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4413658069