AWS releases Agent-EvalKit, an open-source toolkit for systematic AI agent evaluation

No audio yetJun 11, 2026published Jun 11, 2026

Amazon Web Services published Agent-EvalKit on June 11, 2026 — an open-source toolkit released under the Apache 2.0 license that gives developers a structured pipeline for evaluating AI agents by examining complete execution paths rather than just final outputs. The tool integrates directly with AI coding assistants, including Claude Code, Kiro CLI, and Kilo Code, and targets teams building agents on top of frameworks like Strands Agents SDK, LangGraph, and CrewAI.

What's new

"Agent-EvalKit is an open-source toolkit (Apache 2.0) that makes this evaluation infrastructure available by integrating with AI coding assistants," according to the AWS announcement. The workflow is developer-facing: describe evaluation goals in natural language, and the toolkit handles the pipeline from reading the agent's source code through running evaluations and generating a prioritized report.

The evaluation pipeline runs through six phases:

Plan — The coding assistant reads the agent's tool definitions, system prompt, and framework configuration to build a model of what the agent is supposed to do.
Data — Targeted test cases are generated based on the agent's design, including edge cases and failure modes relevant to that agent's role.
Trace — The agent is instrumented with OpenTelemetry-compatible tracing to capture full execution paths.
Run Agent — Tests execute with tracing active, collecting tool calls, intermediate states, and decision points.
Eval — A mix of code-based evaluators and LLM-as-judge approaches scores the execution traces against intended behavior.
Report — Findings surface as prioritized recommendations tied to specific code locations, not abstract scores.

The toolkit works through the developer's existing AI coding assistant rather than as a standalone platform — keeping the evaluation context close to the codebase.

Context

Evaluating AI agents is a known pain point in production deployments. Standard unit tests check outputs but not how an agent arrived at them. An agent can produce a correct final answer through a series of failed tool calls, unnecessary retries, or broken reasoning chains — patterns that only surface when you examine the full trace.

AWS has been building out its agentic ecosystem through the Strands Agents SDK, Amazon Bedrock Agents, and a growing developer toolchain. Agent-EvalKit addresses a gap in that stack: the evaluation layer. Without systematic evaluation, teams ship agents they cannot rigorously verify.

This release follows a broader industry push toward principled agent evaluation. Research-level frameworks like AgentBench and GAIA have established benchmarking standards; Agent-EvalKit targets the practitioner use case — teams with agents in production who need actionable, code-level feedback.

Why it matters

As AI agents move from demos to production workloads, systematic evaluation becomes a prerequisite for enterprise adoption. A toolkit that generates test cases from the agent's own source code, instruments it with standard tracing, and returns recommendations tied to specific code locations lowers the barrier to running real evaluations without requiring custom infrastructure.

The Apache 2.0 license and integration with major agentic frameworks — LangGraph, CrewAI, Strands — means teams are not locked into the AWS ecosystem to benefit. That practical choice should accelerate adoption across the broader developer community regardless of which cloud they build on.

Corroborating sources

Aws.amazon
https://aws.amazon.com/blogs/machine-learning/evaluate-ai-agents-systematically-with-agent-evalkit/
“Agent-EvalKit is an open-source toolkit (Apache 2.0) that makes this evaluation infrastructure available by integrating with AI coding assistants”

What's new

The evaluation pipeline runs through six phases:

Plan — The coding assistant reads the agent's tool definitions, system prompt, and framework configuration to build a model of what the agent is supposed to do.

Data — Targeted test cases are generated based on the agent's design, including edge cases and failure modes relevant to that agent's role.

Trace — The agent is instrumented with OpenTelemetry-compatible tracing to capture full execution paths.

Run Agent — Tests execute with tracing active, collecting tool calls, intermediate states, and decision points.

Eval — A mix of code-based evaluators and LLM-as-judge approaches scores the execution traces against intended behavior.

Report — Findings surface as prioritized recommendations tied to specific code locations, not abstract scores.

The toolkit works through the developer's existing AI coding assistant rather than as a standalone platform — keeping the evaluation context close to the codebase.

Context

Why it matters