Back to Features
Measure & Evaluate
Know exactly how your agents perform. Test systematically. Improve continuously.
How Evaluations Work
Test Datasets
Create sets of inputs and expected outputs. Run your agent against them to measure accuracy.
Experiments
Compare different prompts, models, or settings side-by-side. See which performs best.
AI Judges
Use Claude to evaluate agent responses. Score for accuracy, helpfulness, tone, and safety.
Track Over Time
See how changes affect performance. Set alerts when quality drops below thresholds.
Metrics You Get
Pass Rate
% of test cases that meet your criteria
Average Score
Weighted evaluation score across all tests
Per-Evaluator Breakdown
See scores by each evaluation criterion
Duration
How long each test run takes
What AI Judges Evaluate
Configure your judge to score responses on the criteria that matter to you:
Accuracy - Is the response factually correct?
Helpfulness - Does it actually solve the user's problem?
Tone - Is it professional and appropriate?
Safety - Does it avoid harmful content?
Relevance - Does it stay on topic?
Completeness - Does it address all parts of the question?
Typical Evaluation Workflow
- 1.Create a test dataset with representative inputs and expected behaviors
- 2.Run an experiment - your agent processes each test case
- 3.AI judge scores each response against your criteria
- 4.Review results, identify failures, improve your agent
- 5.Re-run to verify improvements