Handbook - Operations

Production Readiness

CloudGrid production readiness depends on verified release artifacts, hardened deployed configuration, and deployment-specific benchmark evidence.

On this page

CloudGrid has implemented the main product surfaces for local and deployed-mode evaluation. Treat this page as the operator readiness map before exposing a shared CloudGrid environment.

Implemented Surfaces

The current implementation readiness file and repository artifacts show these user-visible surfaces as implemented:

  • OTLP trace, log, and metric ingest over HTTP and gRPC;
  • metric query, metric explorer, and dashboard widget surfaces;
  • live trace subscriptions through GraphQL, the BFF, storage-read live sessions, and storage-write post-persist notifications;
  • company, project, membership, invitation, and SMTP invitation email control-plane flows;
  • project retention policy CRUD in contracts, control-plane, BFF GraphQL, and project settings UI;
  • project alert rule, silence, and history CRUD in contracts, control-plane, BFF GraphQL, and alert management UI;
  • local Docker Compose infrastructure for NATS and SurrealDB;
  • Helm chart and release workflow definitions with static release-artifact validation;
  • root verification scripts and GitHub Actions verification for pull requests and pushes to main.

Production Completion Packages

Do not present CloudGrid as a complete public production distribution until these packages have visible repository artifacts:

AreaCurrent status
Release artifactsRelease workflow and Dockerfiles are present; signed images, image provenance, release manifest, SBOM output, and vulnerability reports are produced when the release workflow runs.
KubernetesHelm chart and profile overlays are present; operators still need environment-specific values, secrets, ingress/TLS, and published image digests.
Retention executionRetention policy CRUD, storage-maintenance batch execution, disabled-by-default scheduling, and the SurrealDB deletion adapter are implemented; enable the scheduler per environment and run the opt-in SurrealDB retention suite before relying on deletion in production.
Alert executionAlert rule/silence/history CRUD, evaluator runtime, project discovery, email/webhook adapter runtime, adapter catalog validation, and dashboard alert widgets are implemented; deployments still need concrete SMTP/webhook environment values.
Production scaleThe performance and scaling spec defines targets and variables; opt-in local and production-like benchmark scripts are present, but each deployment still needs its own recorded benchmark run before being declared production-ready.
Auth hardeningDeployed-mode BFF HTTP, WebSocket, app-shell, collector, storage-read, storage-write, and control-plane authorization boundaries have acceptance coverage; operators still need configured SSO providers and secrets.

Deployment Boundary

diagram
Browser UI cloudgrid-bffpublic ingress OTLP emitters cloudgrid-otlp-collectorpublic ingress NATS request/replyprivate NATS JetStreamprivate cloudgrid-storage-readprivate cloudgrid-control-planeprivate cloudgrid-storage-writeprivate SurrealDBprivate project DBs
Mermaid diagram rendered with beautiful-mermaid.

Only the BFF and OTLP collector are public ingress candidates. NATS and SurrealDB stay private. SurrealDB credentials belong only in storage-read, storage-write, control-plane, and storage-maintenance service environments.

Production Boundary Checklist

  • BFF and OTLP collector are the only public ingress candidates.
  • Use CLOUDGRID_DEPLOYMENT_MODE=deployed and CLOUDGRID_AUTH_MODE=sso.
  • Configure a real SSO provider and a strong CLOUDGRID_SESSION_SECRET.
  • Configure a stable CLOUDGRID_PROVIDER_SECRET_ENCRYPTION_KEY before allowing managed AI provider API keys in deployed mode.
  • Install production Kubernetes deployments with the versioned Helm chart and digest-pinned service images.
  • Verify release-manifest.json, release-values.yaml, checksums, signatures, SBOMs, scan reports, image signatures, and image digests before promotion.
  • Configure SMTP invitation delivery for deployed SSO onboarding, or explicitly set disabled delivery with manual recipient notification.
  • Keep project API keys in a secret manager and send them only as bearer credentials from emitters.
  • Keep local mode off untrusted networks.
  • Keep NATS and SurrealDB private; use external managed or operator-owned dependencies for production.
  • Use self-observability as a normal CloudGrid project with a normal ingest credential.
  • Run production benchmark probes with CLOUDGRID_BENCH_DEPLOYMENT_PROFILE=production-like, CLOUDGRID_BENCH_ENVIRONMENT_ID, and CLOUDGRID_BENCH_IMAGE_TAG against the exact deployment.
  • Run the relevant root verification commands before deployment; see Commands.

Scaling Shape

diagram
BFF replicas NATS collector replicas storage-write workers storage-read replicas control-plane replicas SurrealDB project databases
Mermaid diagram rendered with beautiful-mermaid.

The intended scale path is horizontal at service boundaries. Production-scale storage-write uses pull-consumer semantics once implemented and configured. Do not introduce alternate queues, public realtime protocols, frontend direct storage access, REST telemetry reads, or BFF telemetry aggregation.

Next Step

Review Enterprise Helm install, Release artifact verification, and Sizing and scaling, then use Retention operations and Alerting operations to understand which administrative surfaces are configured versus executed.

Last updated .