Handbook - Evaluations

Datasets

Create schema-backed datasets with rows that contain input, expected output, optional reason, split, curation state, and source evidence.

On this page

A dataset is a project-scoped, versioned set of examples. Every row follows one configured input shape and one configured expected-output shape. This keeps evaluations repeatable and makes optimization evidence easier to trust.

Dataset Settings

Set these before adding many rows:

SettingUse
Input typeChoose text or JSON. JSON values can be validated with JSON Schema.
Expected output typeChoose text or JSON. JSON expected outputs can be validated with JSON Schema.
JSON SchemaDefines the allowed shape when input or expected output is JSON.
Default splitUsually validation for manual rows and training for optimization examples.
Default curation statusUsually draft, needs_review, or ready depending on team process.
Default metricThe metric suggested when creating evaluations from this dataset.
Extraction settingsDefines which trace/span data can be imported into this dataset.
Anonymization and PII treatmentControls whether production-derived candidates are redacted or realistically anonymized before review.

Rows are edited as raw text or raw JSON. For JSON datasets, paste the JSON value and let CloudGrid validate it against the dataset schema. There is no visual JSON builder.

Row Fields

Every row can include:

FieldUse
InputThe value sent to the target during evaluation.
Expected outputThe correct result.
Observed outputOptional production or candidate output that explains why the row exists.
ReasonOptional explanation for why the expected output is correct.
Splittraining, validation, or test.
Curation statusdraft, needs_expected, needs_review, ready, or rejected.
SourceLink back to trace/span/import evidence.
MetadataSmall row labels such as case id, difficulty, customer segment, or edge-case type.

Use training for candidate generation, validation for iterative evaluation, and test for final confidence checks. Keep test rows out of optimization candidate generation.

Trace-Derived Rows

Trace-derived rows are useful when production evidence shows a gap. A row can start as needs_expected or needs_review when the observed output is known but the correct expected output still needs curation.

Good trace-derived examples include:

  • a wrong category from a classifier;
  • a missing or malformed extraction field;
  • a tool call with a bad input;
  • a workflow that reached the right answer with too many steps.

Example Datasets

The repository includes small import-ready examples:

ExampleUse
test_data/ai_eval/classificationSupport intent classification with JSON input and JSON expected output.
test_data/ai_eval/extractionOrder confirmation extraction with schema-validated JSON expected output.

Each folder contains a dataset-settings.json, a rows.jsonl, and a short README.

Last updated .