Skip to main content

Production Deployment


1) Persistence + Durability

  • Run NATS with JetStream persistence (PVCs). Prefer a 3-node NATS cluster for HA.
  • Tune JetStream sync_interval (fsync cadence) for crash durability; the production overlay sets 1s by default.
  • Run Redis with persistence + backups (managed Redis, Redis Sentinel, or Redis Cluster).
  • Verify pointer retention policies (result/context/lock TTLs) match your compliance needs.

2) High Availability

  • Gateway: Stateless — run multiple replicas behind a Service + HPA. Scale freely.
  • Safety kernel: Stateless — can be replicated if your gRPC evaluation load requires it.
  • Scheduler: Uses a distributed Redis lock (cordum:reconciler:default) for the reconciler. Multiple replicas are safe — only one acquires the lock per tick. Scale with care.
  • Workflow engine: Stateful — scale cautiously and coordinate instances.
  • Dashboard: Static SPA served by nginx — scale freely.

The production overlay includes:

  • PodDisruptionBudgets (maxUnavailable: 1) for gateway, scheduler, workflow-engine, and safety-kernel
  • HPAs for gateway (2–10 replicas, CPU 70%, memory 80%) and scheduler (2–10 replicas, same targets)

3) Security Baseline

  • TLS for all ingress traffic.
  • TLS/mTLS for NATS and Redis (or use managed services with encryption).
  • Set CORDUM_ENV=production (or CORDUM_PRODUCTION=true) to enforce TLS on HTTP/gRPC and safety kernel clients.
  • Configure API keys (CORDUM_API_KEYS or CORDUM_API_KEY); production mode fails without keys.
  • Configure user authentication (CORDUM_USER_AUTH_ENABLED=true) for dashboard access with proper user credentials.
  • Set a strong admin password (CORDUM_ADMIN_PASSWORD) for initial admin user creation.
  • Configure policy signature verification (SAFETY_POLICY_PUBLIC_KEY + signature); production mode fails without a public key.
  • Keep gRPC reflection disabled (enable only for debugging with CORDUM_GRPC_REFLECTION=1).
  • HSTS headers are set automatically in production mode (Strict-Transport-Security: max-age=63072000; includeSubDomains).
  • NetworkPolicies to restrict lateral traffic (gateway <-> redis/nats/safety).
  • Secrets stored in a proper secret manager (KMS, Vault, etc.).
  • Rotate API keys and enterprise license material regularly.

TLS Secrets (K8s)

The production K8s overlay expects these TLS secrets:

SecretPurpose
cordum-nats-server-tlsNATS server cert + key + CA
cordum-redis-server-tlsRedis server cert + key + CA
cordum-client-tlsClient cert + key + CA (for service-to-service mTLS)
cordum-ingress-tlsIngress TLS termination

Generate with cert-manager or manually:

# Example: self-signed CA for dev/staging
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-keyout ca.key -out ca.crt -days 365 -nodes -subj "/CN=Cordum CA"

# Server cert (e.g., for NATS)
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-keyout server.key -out server.csr -nodes \
-subj "/CN=cordum-nats" \
-addext "subjectAltName=DNS:cordum-nats,DNS:*.cordum-nats.cordum.svc"
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out server.crt -days 365 -copy_extensions copy

# Create K8s secret
kubectl create secret generic cordum-nats-server-tls \
--from-file=tls.crt=server.crt \
--from-file=tls.key=server.key \
--from-file=ca.crt=ca.crt \
-n cordum

TLS Certificate Hot-Reload

Cordum services that serve TLS use a CertReloader that watches certificate and key files on disk and atomically swaps them without restart or downtime. Existing connections continue with the previously loaded certificate; new connections use the updated one.

ServiceEnv VarsLabel
Gateway (gRPC)GRPC_TLS_CERT, GRPC_TLS_KEYgateway-grpc
Gateway (HTTP)GATEWAY_HTTP_TLS_CERT, GATEWAY_HTTP_TLS_KEYgateway-http
Safety KernelSAFETY_KERNEL_TLS_CERT, SAFETY_KERNEL_TLS_KEYsafety-kernel
Context EngineCONTEXT_ENGINE_TLS_CERT, CONTEXT_ENGINE_TLS_KEYcontext-engine

How it works:

  • On startup, the service loads the cert/key pair from the paths in the env vars.
  • A background goroutine polls the files every 30 seconds, comparing modification times.
  • When a change is detected, the new cert/key pair is loaded and stored via an atomic pointer swap.
  • If the new files fail to parse (corrupted, incomplete write), the old certificate is kept and an error is logged. The service does not crash.

Operational notes:

  • Write the new cert and key atomically (e.g., write to a temp file then mv) to avoid the reloader seeing a half-written file.
  • In Kubernetes, cert-manager volume mounts update atomically via symlinks — the reloader picks up changes on the next 30-second poll.
  • The label in the log messages (e.g., "label": "gateway-grpc") identifies which service reloaded.

JWT Authentication

The gateway supports JWT bearer tokens for authenticating API requests. Two signing algorithms are supported: HS256 (HMAC-SHA256) and RS256 (RSA-SHA256). Both can be configured simultaneously — the gateway selects the algorithm based on the alg header in each token.

Configuration:

Env VarDescription
CORDUM_JWT_HMAC_SECRETHMAC secret for HS256. Prefix with base64: for explicit base64 decoding; raw string bytes used otherwise.
CORDUM_JWT_PUBLIC_KEYInline RSA public key (PEM or DER) for RS256
CORDUM_JWT_PUBLIC_KEY_PATHPath to RSA public key file for RS256 (enables file-based key rotation)
CORDUM_JWT_ISSUERExpected iss claim. Required in production — requests fail without it.
CORDUM_JWT_AUDIENCEExpected aud claim. Required in production — requests fail without it.
CORDUM_JWT_CLOCK_SKEWAllowed clock drift (e.g., 30s). Hard cap: 5 minutes.
CORDUM_JWT_DEFAULT_ROLERole assigned when token has no role claim (default: viewer)
CORDUM_JWT_REQUIREDSet to true to reject all requests without a valid JWT

Production enforcement:

When CORDUM_ENV=production (or CORDUM_PRODUCTION=true):

  • CORDUM_JWT_ISSUER must be set — tokens without a matching iss are rejected.
  • CORDUM_JWT_AUDIENCE must be set — tokens without a matching aud are rejected.
  • In non-production mode, missing issuer/audience logs a warning but allows the request.

Key rotation (RS256):

When CORDUM_JWT_PUBLIC_KEY_PATH is set, the gateway watches the key file for changes, polling every 30 seconds. On detecting a new modification time, it re-reads the file and atomically swaps the validator. Existing in-flight validations complete with the old key; new requests use the updated key.

For HMAC-only configurations (no key file), the watch loop is a no-op — restart the gateway or send SIGHUP to pick up a new secret.

HMAC secret format:

  • Plain string: CORDUM_JWT_HMAC_SECRET=my-secret-key — uses raw bytes.
  • Base64-encoded: CORDUM_JWT_HMAC_SECRET=base64:dGhpcyBpcyBhIHNlY3JldA== — decodes after the prefix.
  • If the value looks like base64 but is missing the base64: prefix, a warning is logged to help operators migrate from older configurations.

4) Observability

Metrics Endpoints

ServicePortPathEnv Override
API Gateway:9092/metricsGATEWAY_METRICS_ADDR
Scheduler:9090/metricsSCHEDULER_METRICS_ADDR
Workflow Engine:9093/health

In production, metrics endpoints bind to loopback unless explicitly set with GATEWAY_METRICS_PUBLIC=1 or SCHEDULER_METRICS_PUBLIC=1.

Health Checks

ServiceEndpointProtocol
Gateway/healthHTTP
Gateway gRPC/grpc.health.v1.Health/CheckgRPC
Schedulermetrics portHTTP
Workflow Engine:9093/healthHTTP

Key Prometheus Metrics

Scheduler metrics (namespace: cordum):

MetricTypeDescription
cordum_jobs_received_totalCounterJobs received by topic
cordum_jobs_dispatched_totalCounterJobs dispatched by topic
cordum_jobs_completed_totalCounterJobs completed by topic and status
cordum_safety_denied_totalCounterJobs denied by safety kernel
cordum_safety_unavailable_totalCounterJobs deferred (safety kernel unreachable)
cordum_output_policy_quarantined_totalCounterOutput policy quarantines
cordum_output_denials_totalCounterOutput policy denials/quarantines
cordum_output_redactions_totalCounterOutput policy redactions
cordum_scheduler_dispatch_latency_secondsHistogramReceive-to-dispatch latency
cordum_output_check_latency_secondsHistogramOutput policy check latency by phase
cordum_scheduler_stale_jobsGaugeStale jobs by state (reconciler)
cordum_scheduler_active_goroutinesGaugeActive handler goroutines
cordum_scheduler_orphan_replayed_totalCounterOrphaned PENDING jobs replayed
cordum_saga_rollbacks_totalCounterSaga rollbacks triggered
cordum_saga_compensation_failed_totalCounterCompensation dispatch failures
cordum_saga_rollback_duration_secondsHistogramSaga rollback duration

Gateway metrics (namespace: cordum):

MetricTypeDescription
cordum_http_requests_totalCounterHTTP requests by method/route/status
cordum_http_request_duration_secondsHistogramRequest latency by method/route

Workflow metrics (namespace: cordum):

MetricTypeDescription
cordum_workflows_started_totalCounterWorkflows started by name
cordum_workflows_completed_totalCounterWorkflows completed by name/status
cordum_workflow_duration_secondsHistogramWorkflow duration by name

Centralized Logging

All Cordum services emit structured JSON logs via log/slog. Configure your log collector (Fluentd, Vector, Promtail) to ingest from stdout.

5) Operational Limits

  • Configure pool timeouts and max retries (config/timeouts.yaml).
  • Apply policy constraints (max runtime, max retries, artifact sizes).
  • Configure rate limits on the gateway:
Env VarDefaultDescription
API_RATE_LIMIT_RPS100Authenticated API rate (requests/second)
API_RATE_LIMIT_BURST200Authenticated API burst
API_PUBLIC_RATE_LIMIT_RPS10Public endpoint rate
API_PUBLIC_RATE_LIMIT_BURST20Public endpoint burst

Rate limit keys have a 30-minute TTL with cleanup every 5 minutes.

  • Configure CORS allowed origins:
# Any of these env vars (checked in order)
CORDUM_ALLOWED_ORIGINS="https://dashboard.example.com"
CORDUM_CORS_ALLOW_ORIGINS="https://dashboard.example.com"
CORS_ALLOW_ORIGINS="https://dashboard.example.com"

Control Plane Boundary Hardening

Before turning on strict topic/schema/worker boundary checks in production, stage the rollout explicitly. These controls are shared across the gateway, scheduler, topic registry, worker credential store, and dashboard Topics page.

Feature flags

VariableRecommended initial valueProduction targetNotes
SCHEMA_ENFORCEMENTwarnenforceGateway rejects bad job payloads only in enforce; scheduler also enforces before dispatch.
WORKER_ATTESTATIONoff or warnenforcewarn keeps older workers alive while surfacing missing/invalid credentials in logs.
WORKER_READINESS_REQUIREDfalsetrue where supportedKeeps legacy heartbeat-only workers schedulable until they send readiness handshakes.
WORKER_READINESS_TTL60s60s to 5mIncrease only if workers handshake infrequently; lower values tighten readiness freshness.
  1. Observe

    • Populate the canonical topic registry (/api/v1/topics or cordumctl topic create)
    • Provision worker credentials for external fleets
    • Keep SCHEMA_ENFORCEMENT=warn
    • Keep WORKER_ATTESTATION=off or warn
    • Keep WORKER_READINESS_REQUIRED=false
  2. Converge

    • Make sure every production topic appears in the dashboard Topics page
    • Verify pack topics point at the correct input/output schemas
    • Rotate workers onto issued credentials and validate attestation logs are clean
    • Confirm workers advertise ready_topics during startup/handshake
  3. Enforce

    • Move SCHEMA_ENFORCEMENT to enforce after submit-time violations reach zero
    • Move WORKER_ATTESTATION to enforce after all production workers have valid credentials
    • Enable WORKER_READINESS_REQUIRED=true only after readiness handshakes are universally present

Operator checks before enabling enforcement

  • cordumctl topic list shows every expected topic with the correct pool and schema IDs
  • Dashboard Topics page shows no unexpected degraded topics
  • cordumctl worker credential list covers all externally managed workers
  • Scheduler logs no recurring attestation failed, unknown_topic, or schema_validation_failed events
  • Staging smoke tests submit representative jobs for every critical topic

Rollback guidance

If enforcement causes production impact, roll back in this order:

  1. Set WORKER_READINESS_REQUIRED=false
  2. Downgrade WORKER_ATTESTATION from enforce to warn (or off if needed)
  3. Downgrade SCHEMA_ENFORCEMENT from enforce to warn

This preserves visibility while reopening the strictest gates first. Record the failing topic, worker ID, or schema ID before rollback so you can close the gap and retry safely.

6) Backup + Restore

Backup Strategy

The production K8s overlay includes CronJobs for hourly backups:

Redis RDB backup (cordum-redis-backup):

# Runs hourly, connects via TLS, dumps RDB to /backup/
redis-cli --tls \
--cacert /etc/cordum/tls/client/ca.crt \
--cert /etc/cordum/tls/client/tls.crt \
--key /etc/cordum/tls/client/tls.key \
-h cordum-redis-0.cordum-redis.cordum.svc -p 6379 \
--rdb /backup/redis-$(date -u +%Y%m%dT%H%M%SZ).rdb

NATS JetStream snapshot (cordum-nats-backup):

# Snapshots both streams
nats stream snapshot CORDUM_SYS /backup/nats-${ts}/CORDUM_SYS.snapshot
nats stream snapshot CORDUM_JOBS /backup/nats-${ts}/CORDUM_JOBS.snapshot

Both CronJobs use a shared 20Gi PVC (cordum-backups) and retain 2 successful + 2 failed job histories.

Restore Procedures

Redis restore from RDB:

# 1. Scale down all Cordum services
kubectl scale deployment --all --replicas=0 -n cordum

# 2. Copy RDB to Redis pod
kubectl cp /path/to/redis-backup.rdb cordum/cordum-redis-0:/data/dump.rdb

# 3. Restart Redis (it loads dump.rdb on startup)
kubectl delete pod cordum-redis-0 -n cordum
# Wait for pod to restart and become ready

# 4. For Redis Cluster: repeat for each node, then run CLUSTER MEET
# 5. Scale Cordum services back up
kubectl scale deployment cordum-api-gateway --replicas=2 -n cordum
kubectl scale deployment cordum-scheduler --replicas=2 -n cordum
kubectl scale deployment cordum-safety-kernel --replicas=2 -n cordum
kubectl scale deployment cordum-workflow-engine --replicas=1 -n cordum

NATS JetStream restore:

# 1. Stop consumers (scale down scheduler + workflow engine)
kubectl scale deployment cordum-scheduler --replicas=0 -n cordum
kubectl scale deployment cordum-workflow-engine --replicas=0 -n cordum

# 2. Restore stream from snapshot
nats stream restore CORDUM_SYS /backup/nats-${ts}/CORDUM_SYS.snapshot \
--server tls://cordum-nats:4222 \
--tlscacert /etc/cordum/tls/client/ca.crt \
--tlscert /etc/cordum/tls/client/tls.crt \
--tlskey /etc/cordum/tls/client/tls.key

nats stream restore CORDUM_JOBS /backup/nats-${ts}/CORDUM_JOBS.snapshot \
--server tls://cordum-nats:4222 \
--tlscacert /etc/cordum/tls/client/ca.crt \
--tlscert /etc/cordum/tls/client/tls.crt \
--tlskey /etc/cordum/tls/client/tls.key

# 3. Scale services back up
kubectl scale deployment cordum-scheduler --replicas=2 -n cordum
kubectl scale deployment cordum-workflow-engine --replicas=1 -n cordum

Configuration backup/restore:

# Backup: export all config from Redis
redis-cli --tls ... GET cfg:system:default > config-backup.json
redis-cli --tls ... DUMP cordum:policies > policies-backup.rdb

# Restore: reimport
redis-cli --tls ... SET cfg:system:default "$(cat config-backup.json)"
# Note: cfg:system:default is write-once (bootstrapConfig). Delete the key
# first if you need to force a reload:
redis-cli --tls ... DEL cfg:system:default

Backup Schedule Recommendations

DataRPOMethodSchedule
Redis (job state, workflows, DLQ, pointers)1 hourRDB snapshotHourly CronJob
NATS JetStream (CORDUM_SYS, CORDUM_JOBS)1 hourStream snapshotHourly CronJob
Safety policiesOn changeGit + Redis dumpCI/CD pipeline
TLS certificatesOn rotationSecret managerCert-manager auto-renewal
Configuration (system.yaml, pools.yaml)On changeGitVersion control

Document a restore drill and run it at least quarterly.

7) Upgrade Strategy

  • Use versioned images and a staged rollout (dev -> staging -> prod).
  • Validate schema/policy changes in staging before publish.
  • Keep backward compatibility for workflow payloads.
  • The production K8s overlay pins all images to v0.9.7 — update kustomization.yaml images section for upgrades.

Rolling Upgrade Procedure

# 1. Update image tags in kustomization.yaml
# 2. Apply to staging first
kubectl apply -k deploy/k8s/production -n cordum-staging

# 3. Run smoke test
bash ./tools/scripts/platform_smoke.sh

# 4. Apply to production
kubectl apply -k deploy/k8s/production -n cordum

# 5. Monitor rollout
kubectl rollout status deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-scheduler -n cordum
kubectl rollout status deployment/cordum-safety-kernel -n cordum

Rollback

# Rollback a specific deployment
kubectl rollout undo deployment/cordum-api-gateway -n cordum

# Rollback all Cordum deployments
for dep in cordum-api-gateway cordum-scheduler cordum-safety-kernel cordum-workflow-engine; do
kubectl rollout undo deployment/$dep -n cordum
done

8) Enterprise Gating (if applicable)

  • Enforce CORDUM_LICENSE_REQUIRED=true for enterprise gateways.
  • Keep public and enterprise images in separate registries.
  • Audit all API keys and principal roles.

Licensing Configuration

Environment Variables

VariableDescription
CORDUM_LICENSE_FILEPath to the license JSON file on disk
CORDUM_LICENSE_TOKENInline license token (alternative to file)
CORDUM_LICENSE_PUBLIC_KEYInline ECDSA public key (PEM) for license signature verification
CORDUM_LICENSE_PUBLIC_KEY_PATHPath to public key file for license verification

Tiers

TierDescription
CommunityDefault when no license is installed. Enforced limits on workers, concurrent jobs, rate limits, and audit retention.
TeamUnlocks higher limits, additional features, and extended audit retention.
EnterpriseFull platform capabilities including SAML/SSO, advanced RBAC, unlimited workers, and priority support.

If no license file or token is provided, the gateway starts in Community tier with enforced limits.

Expiry Lifecycle

  1. Valid — License is active and within its validity window.
  2. Warning (30 days before expiry) — The gateway logs warnings and the dashboard displays an expiry notice. All features remain available.
  3. Grace period (14 days after expiry, configurable) — The license is expired but the platform continues operating at its licensed tier with logged warnings.
  4. Degraded — After the grace period, the gateway falls back to Community tier limits. The platform never bricks — it degrades gracefully to community defaults.

CLI Commands

cordumctl license install ./license.json # Install a license from file
cordumctl license info # Display license plan, entitlements, and expiry
cordumctl license reload # Hot-reload license on running gateway (no restart)

Hot-Reload

Reload a license on a running gateway without restart:

curl -X POST https://gateway:8081/api/v1/license/reload \
-H "Authorization: Bearer $API_KEY"

The gateway re-reads the license from environment or disk, revalidates the signature, and applies the new entitlements immediately.

Telemetry Configuration

Mode

CORDUM_TELEMETRY_MODEBehavior
offNo telemetry data is collected or stored.
local_only (default)Telemetry is collected and stored locally for the dashboard. No data is sent externally.
anonymousAggregate, anonymized usage data is sent to Cordum. Requires explicit opt-in.

The default is local_only — remote reporting is never enabled without explicit operator opt-in.

On first visit, the dashboard displays a consent banner explaining telemetry modes. The operator must choose a mode before telemetry data is transmitted externally.

Inspect Payload

Review exactly what telemetry data would be sent:

curl -H "Authorization: Bearer $API_KEY" \
https://gateway:8081/api/v1/telemetry/inspect

Data Principles

  • No PII collected — telemetry contains only aggregate counts, hashed install ID, and feature flags.
  • No job content, prompts, outputs, user identifiers, or tenant data is ever included.
  • The hashed install ID is a one-way SHA-256 hash of the instance ID — it cannot be reversed to identify the installation.
  • Requests/limits for every pod (already in deploy/k8s/base.yaml).
  • PodDisruptionBudgets for replicated services.
  • Non-root security contexts for Cordum services (already configured: runAsNonRoot: true, readOnlyRootFilesystem: true, seccompProfile: RuntimeDefault).
  • Readiness/liveness probes on every workload.

10) Production K8s Overlay

kubectl apply -k deploy/k8s/production

The overlay swaps in stateful NATS/Redis, enables TLS/mTLS, applies network policies, adds an ingress with TLS, and installs Prometheus ServiceMonitors + rules (requires the Prometheus Operator CRDs).

Redis clustering uses REDIS_CLUSTER_ADDRESSES as seed nodes; update the list if you change the Redis replica count. JetStream replication is controlled by NATS_JS_REPLICAS (set to 3 in the production overlay).

The overlay includes a cordum-redis-cluster-init Job that bootstraps the Redis cluster once the pods are ready. Re-run it if you replace the cluster.



12) Scaling Guide

Service Scaling Characteristics

ServiceStateless?Scaling ModelRecommended Replicas
API GatewayYesHorizontal (HPA)2–10
Safety KernelYesHorizontal2–5
SchedulerSemi (Redis lock)Horizontal with lock2–5
Workflow EngineNo (in-memory run state)Cautious horizontal1–2
DashboardYes (static SPA)Horizontal1–3
NATSNo (JetStream state)StatefulSet cluster3 (fixed)
RedisNo (data store)Cluster (6 nodes)6 (3 primary + 3 replica)

Horizontal Scaling

Gateway: Fully stateless. Add replicas freely. Each instance handles HTTP + gRPC + WebSocket connections independently. The production overlay HPA scales on CPU (70%) and memory (80%).

Safety Kernel: Fully stateless — policies are loaded into memory from Redis/config and cached. Each instance evaluates independently. Scale when cordum_output_eval_duration_seconds p99 exceeds 5ms.

Scheduler: Multiple replicas are safe. The reconciler uses a distributed Redis lock (cordum:reconciler:default, TTL = 2x poll interval) so only one instance runs timeout checks at a time. Job handling is concurrent across all replicas via NATS queue groups. Scale when dispatch latency rises.

Workflow Engine: Holds in-memory run state during DAG execution. Scaling requires coordination — runs are not automatically redistributed. Use 1 replica for simplicity, 2 for HA with the understanding that crash recovery resumes from Redis-persisted state.

Vertical Scaling

Redis memory sizing:

  • Base: ~100 bytes per job record, ~500 bytes per workflow run, ~1KB per DLQ entry
  • Estimate: 1 million concurrent jobs = ~100MB job state + context/result pointers
  • Rule of thumb: 2x your estimated working set for headroom
  • Monitor with: redis-cli INFO memory

NATS JetStream storage:

  • Each JetStream stream has a configurable max size and message TTL
  • Production overlay: 20Gi PVC per NATS node
  • Monitor fill rate: nats stream info CORDUM_JOBS

Capacity Planning

MetricPer Gateway InstancePer Scheduler Instance
HTTP requests/sec~5,000 (depends on payload size)N/A
Job dispatches/secN/A~1,000 (depends on safety check latency)
WebSocket connections~10,000 (limited by file descriptors)N/A
Safety evaluations/secN/A (proxied to kernel)~2,000 per kernel instance

These are approximate — benchmark with your actual workload using tools/scripts/platform_smoke.sh as a starting point.



14) Security Hardening Checklist

AreaCheckHow to Verify
TLS everywhereAll inter-service traffic encryptedkubectl exec ... -- env | grep TLS
API keysStrong, unique keys per tenantCheck CORDUM_API_KEYS length (>= 32 hex chars)
Admin passwordStrong password setCORDUM_ADMIN_PASSWORD is non-empty, >= 12 chars
Policy signaturesSignature verification enabledSAFETY_POLICY_PUBLIC_KEY is set
gRPC reflectionDisabled in productionCORDUM_GRPC_REFLECTION is unset or 0
HSTSAuto-enabled in production modeResponse includes Strict-Transport-Security header
CORSExplicit allowlist (not *)CORDUM_ALLOWED_ORIGINS lists specific domains
Rate limitingConfigured per environmentAPI_RATE_LIMIT_RPS / API_RATE_LIMIT_BURST set
Network policiesLateral traffic restrictedkubectl get networkpolicy -n cordum
Dashboard API key embedDisabledCORDUM_DASHBOARD_EMBED_API_KEY is unset
Secrets managementExternal secret storeNo plaintext secrets in manifests
Security contextsNon-root, read-only filesystemCheck pod spec securityContext
Key rotationSchedule documentedAPI keys and TLS certs have rotation schedule


Environment Variables Reference

Complete list of production-relevant environment variables:

VariableServiceDescription
CORDUM_ENVAllSet to production to enable production mode
CORDUM_PRODUCTIONAllAlternative: set to true for production mode
CORDUM_API_KEYGatewaySingle API key for authentication
CORDUM_API_KEYSGatewayComma-separated list of API keys
CORDUM_USER_AUTH_ENABLEDGatewayEnable user/password authentication
CORDUM_ADMIN_PASSWORDGatewayAdmin user password (required if user auth enabled)
CORDUM_ALLOWED_ORIGINSGatewayCORS allowed origins (comma-separated)
CORDUM_TLS_MIN_VERSIONAllMinimum TLS version (1.2 or 1.3; defaults to 1.3 in production)
CORDUM_GRPC_REFLECTIONGatewayEnable gRPC reflection (1 to enable)
CORDUM_DASHBOARD_EMBED_API_KEYGatewayEmbed API key in dashboard JS (DO NOT use in production)
CORDUM_ALLOW_HEADER_PRINCIPALGatewayAllow X-Principal header override (disabled in production)
GATEWAY_HTTP_ADDRGatewayHTTP listen address (default :8081)
GATEWAY_GRPC_ADDRGatewaygRPC listen address (default :50051)
GATEWAY_METRICS_ADDRGatewayMetrics listen address (default :9092)
GATEWAY_METRICS_PUBLICGatewayAllow public metrics bind (1 to allow)
GATEWAY_HTTP_TLS_CERTGatewayPath to HTTP TLS certificate
GATEWAY_HTTP_TLS_KEYGatewayPath to HTTP TLS key
SAFETY_KERNEL_ADDRGateway, SchedulerSafety kernel gRPC address
SAFETY_KERNEL_TLS_REQUIREDGateway, SchedulerRequire TLS for safety kernel connection
SAFETY_KERNEL_TLS_CAGateway, SchedulerCA cert for safety kernel TLS
SAFETY_POLICY_PUBLIC_KEYSafety KernelECDSA public key for policy signature verification
SAFETY_POLICY_SIGNATURE_REQUIREDSafety KernelRequire policy signatures
SCHEDULER_METRICS_ADDRSchedulerMetrics listen address (default :9090)
SCHEDULER_METRICS_PUBLICSchedulerAllow public metrics bind (1 to allow)
API_RATE_LIMIT_RPSGatewayAuthenticated API rate limit (requests/sec)
API_RATE_LIMIT_BURSTGatewayAuthenticated API burst limit
API_PUBLIC_RATE_LIMIT_RPSGatewayPublic endpoint rate limit
API_PUBLIC_RATE_LIMIT_BURSTGatewayPublic endpoint burst limit
NATS_TLS_CAAllNATS TLS CA certificate path
NATS_TLS_CERTAllNATS TLS client certificate path
NATS_TLS_KEYAllNATS TLS client key path
NATS_JS_REPLICASSchedulerJetStream replication factor
REDIS_TLS_CAAllRedis TLS CA certificate path
REDIS_TLS_CERTAllRedis TLS client certificate path
REDIS_TLS_KEYAllRedis TLS client key path
REDIS_CLUSTER_ADDRESSESAllRedis cluster seed node addresses
CORDUM_LICENSE_REQUIREDGatewayEnforce enterprise license check
CORDUM_LICENSE_FILEGatewayPath to license JSON file
CORDUM_LICENSE_TOKENGatewayInline license token
CORDUM_LICENSE_PUBLIC_KEYGatewayInline ECDSA public key for license verification
CORDUM_LICENSE_PUBLIC_KEY_PATHGatewayPath to public key file for license verification
CORDUM_TELEMETRY_MODEGatewayTelemetry mode: off, local_only (default), anonymous

Production Readiness Guide

This guide covers the minimum hardening steps before running Cordum in production, plus operational runbooks for disaster recovery, incident response, and scaling.

The sample K8s manifests under deploy/k8s are a starter only. For a production-oriented baseline, use the kustomize overlay under deploy/k8s/production/ (stateful NATS/Redis, TLS/mTLS, network policies, monitoring, backups, and HA defaults).

For Helm-based deployments, start with cordum-helm/ and apply the same hardening steps below.

For detailed Kubernetes manifest walkthroughs, see k8s-deployment.md.


1) Persistence + Durability

  • Run NATS with JetStream persistence (PVCs). Prefer a 3-node NATS cluster for HA.
  • Tune JetStream sync_interval (fsync cadence) for crash durability; the production overlay sets 1s by default.
  • Run Redis with persistence + backups (managed Redis, Redis Sentinel, or Redis Cluster).
  • Verify pointer retention policies (result/context/lock TTLs) match your compliance needs.

2) High Availability

  • Gateway: Stateless — run multiple replicas behind a Service + HPA. Scale freely.
  • Safety kernel: Stateless — can be replicated if your gRPC evaluation load requires it.
  • Scheduler: Uses a distributed Redis lock (cordum:reconciler:default) for the reconciler. Multiple replicas are safe — only one acquires the lock per tick. Scale with care.
  • Workflow engine: Stateful — scale cautiously and coordinate instances.
  • Dashboard: Static SPA served by nginx — scale freely.

The production overlay includes:

  • PodDisruptionBudgets (maxUnavailable: 1) for gateway, scheduler, workflow-engine, and safety-kernel
  • HPAs for gateway (2–10 replicas, CPU 70%, memory 80%) and scheduler (2–10 replicas, same targets)

3) Security Baseline

  • TLS for all ingress traffic.
  • TLS/mTLS for NATS and Redis (or use managed services with encryption).
  • Set CORDUM_ENV=production (or CORDUM_PRODUCTION=true) to enforce TLS on HTTP/gRPC and safety kernel clients.
  • Configure API keys (CORDUM_API_KEYS or CORDUM_API_KEY); production mode fails without keys.
  • Configure user authentication (CORDUM_USER_AUTH_ENABLED=true) for dashboard access with proper user credentials.
  • Set a strong admin password (CORDUM_ADMIN_PASSWORD) for initial admin user creation.
  • Configure policy signature verification (SAFETY_POLICY_PUBLIC_KEY + signature); production mode fails without a public key.
  • Keep gRPC reflection disabled (enable only for debugging with CORDUM_GRPC_REFLECTION=1).
  • HSTS headers are set automatically in production mode (Strict-Transport-Security: max-age=63072000; includeSubDomains).
  • NetworkPolicies to restrict lateral traffic (gateway <-> redis/nats/safety).
  • Secrets stored in a proper secret manager (KMS, Vault, etc.).
  • Rotate API keys and enterprise license material regularly.

TLS Secrets (K8s)

The production K8s overlay expects these TLS secrets:

SecretPurpose
cordum-nats-server-tlsNATS server cert + key + CA
cordum-redis-server-tlsRedis server cert + key + CA
cordum-client-tlsClient cert + key + CA (for service-to-service mTLS)
cordum-ingress-tlsIngress TLS termination

Generate with cert-manager or manually:

# Example: self-signed CA for dev/staging
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-keyout ca.key -out ca.crt -days 365 -nodes -subj "/CN=Cordum CA"

# Server cert (e.g., for NATS)
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-keyout server.key -out server.csr -nodes \
-subj "/CN=cordum-nats" \
-addext "subjectAltName=DNS:cordum-nats,DNS:*.cordum-nats.cordum.svc"
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out server.crt -days 365 -copy_extensions copy

# Create K8s secret
kubectl create secret generic cordum-nats-server-tls \
--from-file=tls.crt=server.crt \
--from-file=tls.key=server.key \
--from-file=ca.crt=ca.crt \
-n cordum

TLS Certificate Hot-Reload

Cordum services that serve TLS use a CertReloader that watches certificate and key files on disk and atomically swaps them without restart or downtime. Existing connections continue with the previously loaded certificate; new connections use the updated one.

ServiceEnv VarsLabel
Gateway (gRPC)GRPC_TLS_CERT, GRPC_TLS_KEYgateway-grpc
Gateway (HTTP)GATEWAY_HTTP_TLS_CERT, GATEWAY_HTTP_TLS_KEYgateway-http
Safety KernelSAFETY_KERNEL_TLS_CERT, SAFETY_KERNEL_TLS_KEYsafety-kernel
Context EngineCONTEXT_ENGINE_TLS_CERT, CONTEXT_ENGINE_TLS_KEYcontext-engine

How it works:

  • On startup, the service loads the cert/key pair from the paths in the env vars.
  • A background goroutine polls the files every 30 seconds, comparing modification times.
  • When a change is detected, the new cert/key pair is loaded and stored via an atomic pointer swap.
  • If the new files fail to parse (corrupted, incomplete write), the old certificate is kept and an error is logged. The service does not crash.

Operational notes:

  • Write the new cert and key atomically (e.g., write to a temp file then mv) to avoid the reloader seeing a half-written file.
  • In Kubernetes, cert-manager volume mounts update atomically via symlinks — the reloader picks up changes on the next 30-second poll.
  • The label in the log messages (e.g., "label": "gateway-grpc") identifies which service reloaded.

JWT Authentication

The gateway supports JWT bearer tokens for authenticating API requests. Two signing algorithms are supported: HS256 (HMAC-SHA256) and RS256 (RSA-SHA256). Both can be configured simultaneously — the gateway selects the algorithm based on the alg header in each token.

Configuration:

Env VarDescription
CORDUM_JWT_HMAC_SECRETHMAC secret for HS256. Prefix with base64: for explicit base64 decoding; raw string bytes used otherwise.
CORDUM_JWT_PUBLIC_KEYInline RSA public key (PEM or DER) for RS256
CORDUM_JWT_PUBLIC_KEY_PATHPath to RSA public key file for RS256 (enables file-based key rotation)
CORDUM_JWT_ISSUERExpected iss claim. Required in production — requests fail without it.
CORDUM_JWT_AUDIENCEExpected aud claim. Required in production — requests fail without it.
CORDUM_JWT_CLOCK_SKEWAllowed clock drift (e.g., 30s). Hard cap: 5 minutes.
CORDUM_JWT_DEFAULT_ROLERole assigned when token has no role claim (default: viewer)
CORDUM_JWT_REQUIREDSet to true to reject all requests without a valid JWT

Production enforcement:

When CORDUM_ENV=production (or CORDUM_PRODUCTION=true):

  • CORDUM_JWT_ISSUER must be set — tokens without a matching iss are rejected.
  • CORDUM_JWT_AUDIENCE must be set — tokens without a matching aud are rejected.
  • In non-production mode, missing issuer/audience logs a warning but allows the request.

Key rotation (RS256):

When CORDUM_JWT_PUBLIC_KEY_PATH is set, the gateway watches the key file for changes, polling every 30 seconds. On detecting a new modification time, it re-reads the file and atomically swaps the validator. Existing in-flight validations complete with the old key; new requests use the updated key.

For HMAC-only configurations (no key file), the watch loop is a no-op — restart the gateway or send SIGHUP to pick up a new secret.

HMAC secret format:

  • Plain string: CORDUM_JWT_HMAC_SECRET=my-secret-key — uses raw bytes.
  • Base64-encoded: CORDUM_JWT_HMAC_SECRET=base64:dGhpcyBpcyBhIHNlY3JldA== — decodes after the prefix.
  • If the value looks like base64 but is missing the base64: prefix, a warning is logged to help operators migrate from older configurations.

4) Observability

Metrics Endpoints

ServicePortPathEnv Override
API Gateway:9092/metricsGATEWAY_METRICS_ADDR
Scheduler:9090/metricsSCHEDULER_METRICS_ADDR
Workflow Engine:9093/health

In production, metrics endpoints bind to loopback unless explicitly set with GATEWAY_METRICS_PUBLIC=1 or SCHEDULER_METRICS_PUBLIC=1.

Health Checks

ServiceEndpointProtocol
Gateway/healthHTTP
Gateway gRPC/grpc.health.v1.Health/CheckgRPC
Schedulermetrics portHTTP
Workflow Engine:9093/healthHTTP

Key Prometheus Metrics

Scheduler metrics (namespace: cordum):

MetricTypeDescription
cordum_jobs_received_totalCounterJobs received by topic
cordum_jobs_dispatched_totalCounterJobs dispatched by topic
cordum_jobs_completed_totalCounterJobs completed by topic and status
cordum_safety_denied_totalCounterJobs denied by safety kernel
cordum_safety_unavailable_totalCounterJobs deferred (safety kernel unreachable)
cordum_output_policy_quarantined_totalCounterOutput policy quarantines
cordum_output_denials_totalCounterOutput policy denials/quarantines
cordum_output_redactions_totalCounterOutput policy redactions
cordum_scheduler_dispatch_latency_secondsHistogramReceive-to-dispatch latency
cordum_output_check_latency_secondsHistogramOutput policy check latency by phase
cordum_scheduler_stale_jobsGaugeStale jobs by state (reconciler)
cordum_scheduler_active_goroutinesGaugeActive handler goroutines
cordum_scheduler_orphan_replayed_totalCounterOrphaned PENDING jobs replayed
cordum_saga_rollbacks_totalCounterSaga rollbacks triggered
cordum_saga_compensation_failed_totalCounterCompensation dispatch failures
cordum_saga_rollback_duration_secondsHistogramSaga rollback duration

Gateway metrics (namespace: cordum):

MetricTypeDescription
cordum_http_requests_totalCounterHTTP requests by method/route/status
cordum_http_request_duration_secondsHistogramRequest latency by method/route

Workflow metrics (namespace: cordum):

MetricTypeDescription
cordum_workflows_started_totalCounterWorkflows started by name
cordum_workflows_completed_totalCounterWorkflows completed by name/status
cordum_workflow_duration_secondsHistogramWorkflow duration by name

Centralized Logging

All Cordum services emit structured JSON logs via log/slog. Configure your log collector (Fluentd, Vector, Promtail) to ingest from stdout.

5) Operational Limits

  • Configure pool timeouts and max retries (config/timeouts.yaml).
  • Apply policy constraints (max runtime, max retries, artifact sizes).
  • Configure rate limits on the gateway:
Env VarDefaultDescription
API_RATE_LIMIT_RPS100Authenticated API rate (requests/second)
API_RATE_LIMIT_BURST200Authenticated API burst
API_PUBLIC_RATE_LIMIT_RPS10Public endpoint rate
API_PUBLIC_RATE_LIMIT_BURST20Public endpoint burst

Rate limit keys have a 30-minute TTL with cleanup every 5 minutes.

  • Configure CORS allowed origins:
# Any of these env vars (checked in order)
CORDUM_ALLOWED_ORIGINS="https://dashboard.example.com"
CORDUM_CORS_ALLOW_ORIGINS="https://dashboard.example.com"
CORS_ALLOW_ORIGINS="https://dashboard.example.com"

Control Plane Boundary Hardening

Before turning on strict topic/schema/worker boundary checks in production, stage the rollout explicitly. These controls are shared across the gateway, scheduler, topic registry, worker credential store, and dashboard Topics page.

Feature flags

VariableRecommended initial valueProduction targetNotes
SCHEMA_ENFORCEMENTwarnenforceGateway rejects bad job payloads only in enforce; scheduler also enforces before dispatch.
WORKER_ATTESTATIONoff or warnenforcewarn keeps older workers alive while surfacing missing/invalid credentials in logs.
WORKER_READINESS_REQUIREDfalsetrue where supportedKeeps legacy heartbeat-only workers schedulable until they send readiness handshakes.
WORKER_READINESS_TTL60s60s to 5mIncrease only if workers handshake infrequently; lower values tighten readiness freshness.
  1. Observe

    • Populate the canonical topic registry (/api/v1/topics or cordumctl topic create)
    • Provision worker credentials for external fleets
    • Keep SCHEMA_ENFORCEMENT=warn
    • Keep WORKER_ATTESTATION=off or warn
    • Keep WORKER_READINESS_REQUIRED=false
  2. Converge

    • Make sure every production topic appears in the dashboard Topics page
    • Verify pack topics point at the correct input/output schemas
    • Rotate workers onto issued credentials and validate attestation logs are clean
    • Confirm workers advertise ready_topics during startup/handshake
  3. Enforce

    • Move SCHEMA_ENFORCEMENT to enforce after submit-time violations reach zero
    • Move WORKER_ATTESTATION to enforce after all production workers have valid credentials
    • Enable WORKER_READINESS_REQUIRED=true only after readiness handshakes are universally present

Operator checks before enabling enforcement

  • cordumctl topic list shows every expected topic with the correct pool and schema IDs
  • Dashboard Topics page shows no unexpected degraded topics
  • cordumctl worker credential list covers all externally managed workers
  • Scheduler logs no recurring attestation failed, unknown_topic, or schema_validation_failed events
  • Staging smoke tests submit representative jobs for every critical topic

Rollback guidance

If enforcement causes production impact, roll back in this order:

  1. Set WORKER_READINESS_REQUIRED=false
  2. Downgrade WORKER_ATTESTATION from enforce to warn (or off if needed)
  3. Downgrade SCHEMA_ENFORCEMENT from enforce to warn

This preserves visibility while reopening the strictest gates first. Record the failing topic, worker ID, or schema ID before rollback so you can close the gap and retry safely.

6) Backup + Restore

Backup Strategy

The production K8s overlay includes CronJobs for hourly backups:

Redis RDB backup (cordum-redis-backup):

# Runs hourly, connects via TLS, dumps RDB to /backup/
redis-cli --tls \
--cacert /etc/cordum/tls/client/ca.crt \
--cert /etc/cordum/tls/client/tls.crt \
--key /etc/cordum/tls/client/tls.key \
-h cordum-redis-0.cordum-redis.cordum.svc -p 6379 \
--rdb /backup/redis-$(date -u +%Y%m%dT%H%M%SZ).rdb

NATS JetStream snapshot (cordum-nats-backup):

# Snapshots both streams
nats stream snapshot CORDUM_SYS /backup/nats-${ts}/CORDUM_SYS.snapshot
nats stream snapshot CORDUM_JOBS /backup/nats-${ts}/CORDUM_JOBS.snapshot

Both CronJobs use a shared 20Gi PVC (cordum-backups) and retain 2 successful + 2 failed job histories.

Restore Procedures

Redis restore from RDB:

# 1. Scale down all Cordum services
kubectl scale deployment --all --replicas=0 -n cordum

# 2. Copy RDB to Redis pod
kubectl cp /path/to/redis-backup.rdb cordum/cordum-redis-0:/data/dump.rdb

# 3. Restart Redis (it loads dump.rdb on startup)
kubectl delete pod cordum-redis-0 -n cordum
# Wait for pod to restart and become ready

# 4. For Redis Cluster: repeat for each node, then run CLUSTER MEET
# 5. Scale Cordum services back up
kubectl scale deployment cordum-api-gateway --replicas=2 -n cordum
kubectl scale deployment cordum-scheduler --replicas=2 -n cordum
kubectl scale deployment cordum-safety-kernel --replicas=2 -n cordum
kubectl scale deployment cordum-workflow-engine --replicas=1 -n cordum

NATS JetStream restore:

# 1. Stop consumers (scale down scheduler + workflow engine)
kubectl scale deployment cordum-scheduler --replicas=0 -n cordum
kubectl scale deployment cordum-workflow-engine --replicas=0 -n cordum

# 2. Restore stream from snapshot
nats stream restore CORDUM_SYS /backup/nats-${ts}/CORDUM_SYS.snapshot \
--server tls://cordum-nats:4222 \
--tlscacert /etc/cordum/tls/client/ca.crt \
--tlscert /etc/cordum/tls/client/tls.crt \
--tlskey /etc/cordum/tls/client/tls.key

nats stream restore CORDUM_JOBS /backup/nats-${ts}/CORDUM_JOBS.snapshot \
--server tls://cordum-nats:4222 \
--tlscacert /etc/cordum/tls/client/ca.crt \
--tlscert /etc/cordum/tls/client/tls.crt \
--tlskey /etc/cordum/tls/client/tls.key

# 3. Scale services back up
kubectl scale deployment cordum-scheduler --replicas=2 -n cordum
kubectl scale deployment cordum-workflow-engine --replicas=1 -n cordum

Configuration backup/restore:

# Backup: export all config from Redis
redis-cli --tls ... GET cfg:system:default > config-backup.json
redis-cli --tls ... DUMP cordum:policies > policies-backup.rdb

# Restore: reimport
redis-cli --tls ... SET cfg:system:default "$(cat config-backup.json)"
# Note: cfg:system:default is write-once (bootstrapConfig). Delete the key
# first if you need to force a reload:
redis-cli --tls ... DEL cfg:system:default

Backup Schedule Recommendations

DataRPOMethodSchedule
Redis (job state, workflows, DLQ, pointers)1 hourRDB snapshotHourly CronJob
NATS JetStream (CORDUM_SYS, CORDUM_JOBS)1 hourStream snapshotHourly CronJob
Safety policiesOn changeGit + Redis dumpCI/CD pipeline
TLS certificatesOn rotationSecret managerCert-manager auto-renewal
Configuration (system.yaml, pools.yaml)On changeGitVersion control

Document a restore drill and run it at least quarterly.

7) Upgrade Strategy

  • Use versioned images and a staged rollout (dev -> staging -> prod).
  • Validate schema/policy changes in staging before publish.
  • Keep backward compatibility for workflow payloads.
  • The production K8s overlay pins all images to v0.9.7 — update kustomization.yaml images section for upgrades.

Rolling Upgrade Procedure

# 1. Update image tags in kustomization.yaml
# 2. Apply to staging first
kubectl apply -k deploy/k8s/production -n cordum-staging

# 3. Run smoke test
bash ./tools/scripts/platform_smoke.sh

# 4. Apply to production
kubectl apply -k deploy/k8s/production -n cordum

# 5. Monitor rollout
kubectl rollout status deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-scheduler -n cordum
kubectl rollout status deployment/cordum-safety-kernel -n cordum

Rollback

# Rollback a specific deployment
kubectl rollout undo deployment/cordum-api-gateway -n cordum

# Rollback all Cordum deployments
for dep in cordum-api-gateway cordum-scheduler cordum-safety-kernel cordum-workflow-engine; do
kubectl rollout undo deployment/$dep -n cordum
done

8) Enterprise Gating (if applicable)

  • Enforce CORDUM_LICENSE_REQUIRED=true for enterprise gateways.
  • Keep public and enterprise images in separate registries.
  • Audit all API keys and principal roles.

Licensing Configuration

Environment Variables

VariableDescription
CORDUM_LICENSE_FILEPath to the license JSON file on disk
CORDUM_LICENSE_TOKENInline license token (alternative to file)
CORDUM_LICENSE_PUBLIC_KEYInline ECDSA public key (PEM) for license signature verification
CORDUM_LICENSE_PUBLIC_KEY_PATHPath to public key file for license verification

Tiers

TierDescription
CommunityDefault when no license is installed. Enforced limits on workers, concurrent jobs, rate limits, and audit retention.
TeamUnlocks higher limits, additional features, and extended audit retention.
EnterpriseFull platform capabilities including SAML/SSO, advanced RBAC, unlimited workers, and priority support.

If no license file or token is provided, the gateway starts in Community tier with enforced limits.

Expiry Lifecycle

  1. Valid — License is active and within its validity window.
  2. Warning (30 days before expiry) — The gateway logs warnings and the dashboard displays an expiry notice. All features remain available.
  3. Grace period (14 days after expiry, configurable) — The license is expired but the platform continues operating at its licensed tier with logged warnings.
  4. Degraded — After the grace period, the gateway falls back to Community tier limits. The platform never bricks — it degrades gracefully to community defaults.

CLI Commands

cordumctl license install ./license.json # Install a license from file
cordumctl license info # Display license plan, entitlements, and expiry
cordumctl license reload # Hot-reload license on running gateway (no restart)

Hot-Reload

Reload a license on a running gateway without restart:

curl -X POST https://gateway:8081/api/v1/license/reload \
-H "Authorization: Bearer $API_KEY"

The gateway re-reads the license from environment or disk, revalidates the signature, and applies the new entitlements immediately.

Telemetry Configuration

Mode

CORDUM_TELEMETRY_MODEBehavior
offNo telemetry data is collected or stored.
local_only (default)Telemetry is collected and stored locally for the dashboard. No data is sent externally.
anonymousAggregate, anonymized usage data is sent to Cordum. Requires explicit opt-in.

The default is local_only — remote reporting is never enabled without explicit operator opt-in.

On first visit, the dashboard displays a consent banner explaining telemetry modes. The operator must choose a mode before telemetry data is transmitted externally.

Inspect Payload

Review exactly what telemetry data would be sent:

curl -H "Authorization: Bearer $API_KEY" \
https://gateway:8081/api/v1/telemetry/inspect

Data Principles

  • No PII collected — telemetry contains only aggregate counts, hashed install ID, and feature flags.
  • No job content, prompts, outputs, user identifiers, or tenant data is ever included.
  • The hashed install ID is a one-way SHA-256 hash of the instance ID — it cannot be reversed to identify the installation.
  • Requests/limits for every pod (already in deploy/k8s/base.yaml).
  • PodDisruptionBudgets for replicated services.
  • Non-root security contexts for Cordum services (already configured: runAsNonRoot: true, readOnlyRootFilesystem: true, seccompProfile: RuntimeDefault).
  • Readiness/liveness probes on every workload.

10) Production K8s Overlay

kubectl apply -k deploy/k8s/production

The overlay swaps in stateful NATS/Redis, enables TLS/mTLS, applies network policies, adds an ingress with TLS, and installs Prometheus ServiceMonitors + rules (requires the Prometheus Operator CRDs).

Redis clustering uses REDIS_CLUSTER_ADDRESSES as seed nodes; update the list if you change the Redis replica count. JetStream replication is controlled by NATS_JS_REPLICAS (set to 3 in the production overlay).

The overlay includes a cordum-redis-cluster-init Job that bootstraps the Redis cluster once the pods are ready. Re-run it if you replace the cluster.


11) Incident Runbooks

Runbook: Safety Kernel Unreachable

Symptoms: cordum_safety_unavailable_total rising, jobs stuck in PENDING.

Impact: The scheduler cannot evaluate safety policies. Jobs requiring safety checks are deferred with exponential backoff (up to safetyThrottleDelay = 5s). No jobs are auto-approved — the system fails closed.

Steps:

  1. Check safety kernel pod status:

    kubectl get pods -l app=cordum-safety-kernel -n cordum
    kubectl logs -l app=cordum-safety-kernel -n cordum --tail=50
  2. Verify gRPC connectivity:

    # From gateway pod
    grpcurl -cacert /etc/cordum/tls/client/ca.crt \
    cordum-safety-kernel:50051 grpc.health.v1.Health/Check
  3. Check TLS configuration:

    # Verify env vars are set
    kubectl exec deploy/cordum-api-gateway -n cordum -- \
    env | grep SAFETY_KERNEL
    # Expected: SAFETY_KERNEL_ADDR, SAFETY_KERNEL_TLS_CA, etc.
  4. If the kernel is crashlooping, check for policy parse errors:

    kubectl logs -l app=cordum-safety-kernel -n cordum | grep -i "error\|panic"
  5. Escalation: If the safety kernel cannot be restored quickly, all new jobs will queue indefinitely. There is no bypass — this is by design (safety-first). Scale the safety kernel if it's a capacity issue:

    kubectl scale deployment cordum-safety-kernel --replicas=3 -n cordum

Runbook: Redis Connection Lost

Symptoms: HTTP 500 errors across all endpoints, scheduler cannot persist state, DLQ unavailable.

Impact: Critical — Redis stores all job state, workflow state, DLQ, configuration, and pointer data. All services degrade.

Steps:

  1. Check Redis pod status:

    kubectl get pods -l app=redis -n cordum
    kubectl logs -l app=redis -n cordum --tail=50
  2. Test connectivity:

    redis-cli --tls \
    --cacert /etc/cordum/tls/client/ca.crt \
    --cert /etc/cordum/tls/client/tls.crt \
    --key /etc/cordum/tls/client/tls.key \
    -h cordum-redis-0.cordum-redis.cordum.svc PING
  3. For Redis Cluster, check cluster health:

    redis-cli --tls ... CLUSTER INFO
    # Key fields:
    # cluster_state:ok — cluster is healthy
    # cluster_slots_ok:16384 — all hash slots assigned
    # cluster_known_nodes:6 — all nodes visible
  4. Check individual node roles and replication:

    redis-cli --tls ... CLUSTER NODES
    # Each line shows: <id> <ip>:<port> <flags> <master-id> <ping-sent> <pong-recv> <epoch> <link-state> <slot-range>
    # Look for: 3 lines with "master", 3 with "slave", no "fail" or "fail?" flags
  5. Check for stuck slot migrations:

    redis-cli --tls ... CLUSTER NODES | grep -E "migrating|importing"
    # Should return empty — any output means a rebalance is stuck
  6. If nodes are down, check PVC status:

    kubectl get pvc -l app=redis -n cordum
  7. In-flight job impact: Jobs in DISPATCHED or RUNNING state cannot transition. The reconciler will mark them as timed out once Redis is restored. Workers holding these jobs will see stale state.

  8. Data loss assessment: Check the last successful backup timestamp:

    kubectl get jobs -l job-name=cordum-redis-backup -n cordum --sort-by=.status.startTime

    The backup CronJob backs up all primaries (not just node-0). Each backup produces per-shard files: redis-{timestamp}-node{N}.rdb.

  9. Recovery: If a single Redis node is lost, the cluster auto-heals via automatic failover within 5s (cluster-node-timeout). For multi-node or full cluster loss, see the detailed DR runbooks in Redis Operations Guide.

Runbook: NATS Partition or Failure

Symptoms: Jobs not being dispatched, heartbeats missing, WebSocket stream silent.

Impact: NATS is the message bus for all job lifecycle events, heartbeats, and audit trail. A NATS failure halts job dispatch and worker communication.

Steps:

  1. Check NATS cluster status:

    kubectl get pods -l app=cordum-nats -n cordum
    nats server list --server tls://cordum-nats:4222 ...
  2. Check JetStream health:

    nats stream list --server tls://cordum-nats:4222 ...
    nats stream info CORDUM_SYS --server tls://cordum-nats:4222 ...
    nats stream info CORDUM_JOBS --server tls://cordum-nats:4222 ...
  3. Message delivery guarantees: NATS JetStream provides at-least-once delivery for job requests. Messages published during a partition may be lost if not yet replicated. The reconciler (cordum:reconciler:default) will detect stuck jobs and re-queue them.

  4. Manual replay: For orphaned PENDING jobs, the scheduler's pending replayer periodically scans and re-dispatches them. Monitor cordum_scheduler_orphan_replayed_total to confirm replay is working.

  5. Recovery: If a NATS node is permanently lost:

    # Delete the failed pod (StatefulSet will recreate)
    kubectl delete pod cordum-nats-2 -n cordum
    # Wait for it to rejoin the cluster
    nats server list ...

Runbook: DLQ Overflow

Symptoms: cordum_dlq_depth rising steadily, DLQ page in dashboard showing hundreds of entries.

Impact: Dead-lettered jobs are not being processed. May indicate a systemic issue (bad policy, broken worker pool, misconfigured topic routing).

Steps:

  1. Check DLQ depth and recent entries:

    curl -H "X-API-Key: $CORDUM_API_KEY" -H "X-Tenant-ID: default" \
    https://cordum.example.com/api/v1/dlq?limit=10
  2. Triage: Group entries by error reason:

    • timeout: dispatched exceeded — Workers not picking up jobs. Check pool health and topic routing.
    • timeout: running exceeded — Workers hanging. Check worker logs and increase timeout if legitimate.
    • safety: denied — Policy is blocking jobs. Review safety policy rules.
    • output_quarantined — Output policy flagged results. Review in the dashboard Quarantine tab.
    • max retries exceeded — Persistent failures. Check worker error logs.
  3. Bulk retry (safe entries):

    # Retry a specific entry
    curl -X POST -H "X-API-Key: $CORDUM_API_KEY" -H "X-Tenant-ID: default" \
    https://cordum.example.com/api/v1/dlq/JOB_ID/retry

    # Delete entries that should not be retried
    curl -X DELETE -H "X-API-Key: $CORDUM_API_KEY" -H "X-Tenant-ID: default" \
    https://cordum.example.com/api/v1/dlq/JOB_ID
  4. Alerting threshold: Set an alert when DLQ depth exceeds your SLO:

    - alert: CordumDLQDepthHigh
    expr: cordum_dlq_entries > 100
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: DLQ depth exceeds 100 entries

Runbook: High Latency

Symptoms: cordum_scheduler_dispatch_latency_seconds p99 > 1s, cordum_http_request_duration_seconds p99 > 500ms, dashboard feels slow.

Impact: Job throughput degrades, workflow completion times increase, user experience suffers.

Steps:

  1. Identify the bottleneck:

    # Gateway latency by route
    histogram_quantile(0.99, rate(cordum_http_request_duration_seconds_bucket[5m]))

    # Scheduler dispatch latency
    histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))

    # Output policy check latency
    histogram_quantile(0.99, rate(cordum_output_check_latency_seconds_bucket[5m]))

    # Safety kernel evaluation (should be < 5ms p99)
    histogram_quantile(0.99, rate(cordum_output_eval_duration_seconds_bucket[5m]))
  2. Redis latency: Check Redis slowlog:

    redis-cli --tls ... SLOWLOG GET 10
    redis-cli --tls ... INFO stats | grep instantaneous_ops_per_sec
  3. NATS throughput: Check message rates:

    nats server report jetstream --server tls://cordum-nats:4222 ...
  4. Scaling triggers:

    • Gateway CPU > 70% sustained: HPA should auto-scale (check HPA status)
    • Scheduler dispatch latency > 500ms: Consider adding scheduler replicas
    • Safety kernel latency > 5ms p99: Scale safety kernel replicas
    • Redis ops/sec > 80% of max: Vertical scale or add read replicas
  5. Quick wins:

    • Increase gateway/scheduler replicas manually if HPA is too slow
    • Check for hot topics (single topic consuming all scheduler capacity)
    • Verify job lock contention: cordum_scheduler_job_lock_wait_seconds

Runbook: Worker Pool Unresponsive

Symptoms: Jobs stuck in DISPATCHED state, heartbeats missing for a pool, cordum_scheduler_stale_jobs{state="dispatched"} rising.

Steps:

  1. Check recent heartbeats for the pool:

    # Via the dashboard Agent Fleet page, or:
    curl -H "X-API-Key: $CORDUM_API_KEY" -H "X-Tenant-ID: default" \
    https://cordum.example.com/api/v1/workers
  2. Verify workers are connected to NATS and subscribed to the correct topic.

  3. The reconciler will automatically time out DISPATCHED jobs after dispatchTimeout (default: configurable via config/timeouts.yaml) and move them to FAILED/DLQ.

  4. If the pool is permanently gone, DLQ the stuck jobs and reconfigure topic routing.


12) Scaling Guide

Service Scaling Characteristics

ServiceStateless?Scaling ModelRecommended Replicas
API GatewayYesHorizontal (HPA)2–10
Safety KernelYesHorizontal2–5
SchedulerSemi (Redis lock)Horizontal with lock2–5
Workflow EngineNo (in-memory run state)Cautious horizontal1–2
DashboardYes (static SPA)Horizontal1–3
NATSNo (JetStream state)StatefulSet cluster3 (fixed)
RedisNo (data store)Cluster (6 nodes)6 (3 primary + 3 replica)

Horizontal Scaling

Gateway: Fully stateless. Add replicas freely. Each instance handles HTTP + gRPC + WebSocket connections independently. The production overlay HPA scales on CPU (70%) and memory (80%).

Safety Kernel: Fully stateless — policies are loaded into memory from Redis/config and cached. Each instance evaluates independently. Scale when cordum_output_eval_duration_seconds p99 exceeds 5ms.

Scheduler: Multiple replicas are safe. The reconciler uses a distributed Redis lock (cordum:reconciler:default, TTL = 2x poll interval) so only one instance runs timeout checks at a time. Job handling is concurrent across all replicas via NATS queue groups. Scale when dispatch latency rises.

Workflow Engine: Holds in-memory run state during DAG execution. Scaling requires coordination — runs are not automatically redistributed. Use 1 replica for simplicity, 2 for HA with the understanding that crash recovery resumes from Redis-persisted state.

Vertical Scaling

Redis memory sizing:

  • Base: ~100 bytes per job record, ~500 bytes per workflow run, ~1KB per DLQ entry
  • Estimate: 1 million concurrent jobs = ~100MB job state + context/result pointers
  • Rule of thumb: 2x your estimated working set for headroom
  • Monitor with: redis-cli INFO memory

NATS JetStream storage:

  • Each JetStream stream has a configurable max size and message TTL
  • Production overlay: 20Gi PVC per NATS node
  • Monitor fill rate: nats stream info CORDUM_JOBS

Capacity Planning

MetricPer Gateway InstancePer Scheduler Instance
HTTP requests/sec~5,000 (depends on payload size)N/A
Job dispatches/secN/A~1,000 (depends on safety check latency)
WebSocket connections~10,000 (limited by file descriptors)N/A
Safety evaluations/secN/A (proxied to kernel)~2,000 per kernel instance

These are approximate — benchmark with your actual workload using tools/scripts/platform_smoke.sh as a starting point.


13) Monitoring Alerts

Add these to your PrometheusRule resource alongside the platform alerts in deploy/k8s/production/monitoring.yaml:

groups:
- name: cordum.operations
rules:
# Safety kernel evaluation latency
- alert: CordumSafetyKernelSlow
expr: |
histogram_quantile(0.99,
rate(cordum_output_eval_duration_seconds_bucket[5m])
) > 0.005
for: 5m
labels:
severity: warning
annotations:
summary: Safety kernel p99 latency exceeds 5ms
description: >
Output policy evaluation is slower than the 5ms p99 target.
Consider scaling safety kernel replicas.

# DLQ depth
- alert: CordumDLQDepthHigh
expr: cordum_dlq_entries > 100
for: 10m
labels:
severity: warning
annotations:
summary: DLQ depth exceeds 100 entries
description: >
Dead-letter queue has {{ $value }} entries.
Triage and retry or delete entries.

# Job failure rate
- alert: CordumJobFailureRateHigh
expr: |
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ rate(cordum_jobs_completed_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: Job failure rate exceeds 10%
description: >
More than 10% of completed jobs are failing.
Check worker logs and DLQ for root cause.

# Worker heartbeat gaps
- alert: CordumWorkerHeartbeatGap
expr: |
time() - cordum_last_heartbeat_timestamp > 60
for: 5m
labels:
severity: warning
annotations:
summary: Worker heartbeat gap exceeds 60 seconds
description: >
A worker pool has not sent a heartbeat in over 60 seconds.
Check worker connectivity and NATS health.

# Dispatch latency
- alert: CordumDispatchLatencyHigh
expr: |
histogram_quantile(0.99,
rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: Scheduler dispatch p99 latency exceeds 1 second
description: >
Jobs are taking more than 1 second to dispatch.
Check scheduler capacity and Redis latency.

# Saga compensation failures
- alert: CordumSagaCompensationFailing
expr: rate(cordum_saga_compensation_failed_total[10m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: Saga compensation dispatches are failing
description: >
Compensation jobs are failing to dispatch.
Check NATS connectivity and safety kernel availability.

# Stale jobs (reconciler not clearing them)
- alert: CordumStaleJobsHigh
expr: cordum_scheduler_stale_jobs > 50
for: 15m
labels:
severity: warning
annotations:
summary: More than 50 stale jobs detected
description: >
The reconciler has detected {{ $value }} stale jobs in state {{ $labels.state }}.
Check reconciler logs and Redis connectivity.

# Output policy quarantine spike
- alert: CordumOutputQuarantineSpike
expr: rate(cordum_output_policy_quarantined_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: Output quarantine rate exceeds 1/second
description: >
Unusual number of outputs being quarantined.
Review output policy rules and recent job outputs.

14) Security Hardening Checklist

AreaCheckHow to Verify
TLS everywhereAll inter-service traffic encryptedkubectl exec ... -- env | grep TLS
API keysStrong, unique keys per tenantCheck CORDUM_API_KEYS length (>= 32 hex chars)
Admin passwordStrong password setCORDUM_ADMIN_PASSWORD is non-empty, >= 12 chars
Policy signaturesSignature verification enabledSAFETY_POLICY_PUBLIC_KEY is set
gRPC reflectionDisabled in productionCORDUM_GRPC_REFLECTION is unset or 0
HSTSAuto-enabled in production modeResponse includes Strict-Transport-Security header
CORSExplicit allowlist (not *)CORDUM_ALLOWED_ORIGINS lists specific domains
Rate limitingConfigured per environmentAPI_RATE_LIMIT_RPS / API_RATE_LIMIT_BURST set
Network policiesLateral traffic restrictedkubectl get networkpolicy -n cordum
Dashboard API key embedDisabledCORDUM_DASHBOARD_EMBED_API_KEY is unset
Secrets managementExternal secret storeNo plaintext secrets in manifests
Security contextsNon-root, read-only filesystemCheck pod spec securityContext
Key rotationSchedule documentedAPI keys and TLS certs have rotation schedule

15) Runbook Checklist

Pre-go-live verification:

  • Smoke test the platform (bash ./tools/scripts/platform_smoke.sh)
  • Verify DLQ + retry flow
  • Verify policy evaluate/simulate/explain endpoints
  • Confirm audit trail (run timeline + approval metadata)
  • Test backup + restore drill
  • Verify all Prometheus alerts are firing correctly (use amtool or Alertmanager UI)
  • Load test at expected peak (use hey or k6 against the gateway)
  • Verify graceful shutdown (drain connections, complete in-flight jobs)
  • Confirm log aggregation pipeline captures all services
  • Review network policies block unexpected traffic

Environment Variables Reference

Complete list of production-relevant environment variables:

VariableServiceDescription
CORDUM_ENVAllSet to production to enable production mode
CORDUM_PRODUCTIONAllAlternative: set to true for production mode
CORDUM_API_KEYGatewaySingle API key for authentication
CORDUM_API_KEYSGatewayComma-separated list of API keys
CORDUM_USER_AUTH_ENABLEDGatewayEnable user/password authentication
CORDUM_ADMIN_PASSWORDGatewayAdmin user password (required if user auth enabled)
CORDUM_ALLOWED_ORIGINSGatewayCORS allowed origins (comma-separated)
CORDUM_TLS_MIN_VERSIONAllMinimum TLS version (1.2 or 1.3; defaults to 1.3 in production)
CORDUM_GRPC_REFLECTIONGatewayEnable gRPC reflection (1 to enable)
CORDUM_DASHBOARD_EMBED_API_KEYGatewayEmbed API key in dashboard JS (DO NOT use in production)
CORDUM_ALLOW_HEADER_PRINCIPALGatewayAllow X-Principal header override (disabled in production)
GATEWAY_HTTP_ADDRGatewayHTTP listen address (default :8081)
GATEWAY_GRPC_ADDRGatewaygRPC listen address (default :50051)
GATEWAY_METRICS_ADDRGatewayMetrics listen address (default :9092)
GATEWAY_METRICS_PUBLICGatewayAllow public metrics bind (1 to allow)
GATEWAY_HTTP_TLS_CERTGatewayPath to HTTP TLS certificate
GATEWAY_HTTP_TLS_KEYGatewayPath to HTTP TLS key
SAFETY_KERNEL_ADDRGateway, SchedulerSafety kernel gRPC address
SAFETY_KERNEL_TLS_REQUIREDGateway, SchedulerRequire TLS for safety kernel connection
SAFETY_KERNEL_TLS_CAGateway, SchedulerCA cert for safety kernel TLS
SAFETY_POLICY_PUBLIC_KEYSafety KernelECDSA public key for policy signature verification
SAFETY_POLICY_SIGNATURE_REQUIREDSafety KernelRequire policy signatures
SCHEDULER_METRICS_ADDRSchedulerMetrics listen address (default :9090)
SCHEDULER_METRICS_PUBLICSchedulerAllow public metrics bind (1 to allow)
API_RATE_LIMIT_RPSGatewayAuthenticated API rate limit (requests/sec)
API_RATE_LIMIT_BURSTGatewayAuthenticated API burst limit
API_PUBLIC_RATE_LIMIT_RPSGatewayPublic endpoint rate limit
API_PUBLIC_RATE_LIMIT_BURSTGatewayPublic endpoint burst limit
NATS_TLS_CAAllNATS TLS CA certificate path
NATS_TLS_CERTAllNATS TLS client certificate path
NATS_TLS_KEYAllNATS TLS client key path
NATS_JS_REPLICASSchedulerJetStream replication factor
REDIS_TLS_CAAllRedis TLS CA certificate path
REDIS_TLS_CERTAllRedis TLS client certificate path
REDIS_TLS_KEYAllRedis TLS client key path
REDIS_CLUSTER_ADDRESSESAllRedis cluster seed node addresses
CORDUM_LICENSE_REQUIREDGatewayEnforce enterprise license check
CORDUM_LICENSE_FILEGatewayPath to license JSON file
CORDUM_LICENSE_TOKENGatewayInline license token
CORDUM_LICENSE_PUBLIC_KEYGatewayInline ECDSA public key for license verification
CORDUM_LICENSE_PUBLIC_KEY_PATHGatewayPath to public key file for license verification
CORDUM_TELEMETRY_MODEGatewayTelemetry mode: off, local_only (default), anonymous