Evaluations
Use collected CloudGrid telemetry to build datasets, run evaluations, compare candidates, optimize targets, and promote improvements.
On this page
CloudGrid evaluations help teams turn known examples and trace-backed production evidence into repeatable quality measurements for AI workflows.
The workflow is:
- Prepare a dataset.
- Run an evaluation against a target.
- Inspect aggregate and per-row results.
- Compare a baseline and candidate.
- Optimize prompts or examples.
- Promote the candidate that has validation evidence.
Evaluation produces metrics, results, examples, and comparisons. It does not create a release gate by itself.
When To Use Collected Telemetry
Use collected telemetry when the row should explain a real behavior:
- a trace shows a wrong classification;
- a span output is missing a JSON field;
- a tool call took the workflow down the wrong path;
- a model or tool sequence used too many steps;
- a production output is useful, but the correct expected output still needs human review.
CloudGrid keeps source links and bounded trajectory summaries with evaluation results so the team can review evidence without copying full traces into a dataset.
The Evaluation Loop
Where To Go Next
| Goal | Page |
|---|---|
| Create and manage reusable examples | Datasets |
| Run and inspect evaluation results | Evaluations |
| Improve a target with evidence | Optimizations |
Enable AI Eval in project settings before using these workflows. The AI Eval workspace then appears in the selected-project sidebar.
Last updated .