Handbook - Evaluations

Evaluations

Use collected CloudGrid telemetry to build datasets, run evaluations, compare candidates, optimize targets, and promote improvements.

On this page

CloudGrid evaluations help teams turn known examples and trace-backed production evidence into repeatable quality measurements for AI workflows.

The workflow is:

Prepare a dataset.
Run an evaluation against a target.
Inspect aggregate and per-row results.
Compare a baseline and candidate.
Optimize prompts or examples.
Promote the candidate that has validation evidence.

Evaluation produces metrics, results, examples, and comparisons. It does not create a release gate by itself.

When To Use Collected Telemetry

Use collected telemetry when the row should explain a real behavior:

a trace shows a wrong classification;
a span output is missing a JSON field;
a tool call took the workflow down the wrong path;
a model or tool sequence used too many steps;
a production output is useful, but the correct expected output still needs human review.

CloudGrid keeps source links and bounded trajectory summaries with evaluation results so the team can review evidence without copying full traces into a dataset.

The Evaluation Loop

Where To Go Next

Goal	Page
Create and manage reusable examples	Datasets
Run and inspect evaluation results	Evaluations
Improve a target with evidence	Optimizations

Enable AI Eval in project settings before using these workflows. The AI Eval workspace then appears in the selected-project sidebar.

Last updated 2026-05-25.