Kubernetes Deployment Guide

Detailed guide for the checked-in Kubernetes manifests. The base manifest is a development path. The production overlay documents the intended topology, but the current revision is a reference artifact and must not be applied until its render gate is repaired and passes.

For Docker Compose local development, see DOCKER.md. For the production readiness checklist, see production.md. For Helm chart deployment, see helm.md.

Directory Structure

deploy/k8s/
├── base.yaml                          # All-in-one dev manifest
├── ingress.yaml                       # Dev ingress (no TLS)
├── local-patches.yaml                 # Local dev overrides
├── local/
│   └── kustomization.yaml             # Local kustomize overlay
├── production/
│   ├── kustomization.yaml             # Production overlay entry point
│   ├── nats.yaml                      # NATS 3-node StatefulSet + TLS cluster
│   ├── redis.yaml                     # Redis 6-node cluster + TLS + exporter
│   ├── ingress.yaml                   # Ingress with TLS termination
│   ├── ha.yaml                        # PDBs + HPAs
│   ├── monitoring.yaml                # ServiceMonitors + PrometheusRules
│   ├── networkpolicy.yaml             # Ingress/egress network policies
│   ├── backup.yaml                    # CronJob backups (Redis RDB + NATS snapshots)
│   └── patches/
│       ├── delete-nats-deployment.yaml    # Remove dev NATS Deployment
│       ├── delete-redis-deployment.yaml   # Remove dev Redis Deployment
│       ├── tls-env.yaml                   # Inject TLS env vars + volume mounts
│       ├── service-labels.yaml            # Add app labels to Services
│       └── pod-anti-affinity.yaml         # Spread pods across nodes
└── README.md

1. Base Manifest (Development)

deploy/k8s/base.yaml is the checked-in, single-file development install. It contains empty placeholders for both required credentials, so do not apply the source file unchanged. Materialize the secret values into an untracked local copy, verify both placeholders were replaced, and apply that copy:

set -euo pipefail
export API_KEY="$(openssl rand -hex 32)"
export REDIS_PASSWORD="$(openssl rand -hex 32)"
umask 077
trap 'rm -f cordum-base.local.yaml' EXIT

python - <<'PY'
import json
import os
from pathlib import Path

source = Path("deploy/k8s/base.yaml").read_text(encoding="utf-8")
for key in ("API_KEY", "REDIS_PASSWORD"):
    placeholder = f'  {key}: ""'
    if source.count(placeholder) != 1:
        raise SystemExit(f"expected one {key} placeholder")
    source = source.replace(
        placeholder,
        f"  {key}: {json.dumps(os.environ[key])}",
        1,
    )
Path("cordum-base.local.yaml").write_text(source, encoding="utf-8")
PY

if grep -qE 'API_KEY: ""|REDIS_PASSWORD: ""' cordum-base.local.yaml; then
  echo "required secret placeholder remains" >&2
  exit 1
fi
kubectl apply -f cordum-base.local.yaml

This path assumes the six Cordum development images referenced by base.yaml are available to the cluster. It is not a production topology. Keep generated manifests out of source control, and use a secret manager for shared clusters.

Resource Inventory

Resource	Kind	Ports	Probe
`nats`	Deployment (1 replica)	4222 (client)	TCP :4222
`redis`	Deployment (1 replica)	6379	TCP :6379
`cordum-context-engine`	Deployment (1 replica)	50070 (gRPC)	gRPC :50070
`cordum-safety-kernel`	Deployment (1 replica)	50051 (gRPC)	gRPC :50051
`cordum-scheduler`	Deployment (1 replica)	9090 (metrics)	HTTP /metrics :9090
`cordum-api-gateway`	Deployment (1 replica)	8080 (gRPC), 8081 (HTTP), 9092 (metrics)	HTTP /health :8081
`cordum-workflow-engine`	Deployment (1 replica)	9093 (HTTP)	HTTP /health :9093
`cordum-dashboard`	Deployment (1 replica)	8080 (HTTP)	HTTP / :8080

ConfigMaps

ConfigMap	Mount Path	Description
`cordum-pools`	`/etc/cordum/pools.yaml`	Topic-to-pool routing
`cordum-timeouts`	`/etc/cordum/timeouts.yaml`	Dispatch/running timeouts, reconciler interval
`cordum-nats-config`	`/etc/nats/nats.conf`	NATS server config (JetStream sync)
`cordum-safety`	`/etc/cordum/safety.yaml`	Safety kernel policy

Secrets

Secret	Keys	Used By
`cordum-api-key`	`API_KEY`	Gateway, Dashboard
`cordum-redis-secret`	`REDIS_PASSWORD`	Redis and Cordum services
`cordum-admin-creds`	`CORDUM_ADMIN_USERNAME`, `CORDUM_ADMIN_PASSWORD`, `CORDUM_ADMIN_EMAIL`	Gateway (optional)

The materialization step above sets the two required values before Kubernetes ever receives the manifest. If user authentication is enabled, also replace the empty CORDUM_ADMIN_PASSWORD and CORDUM_ADMIN_EMAIL placeholders.

Security Defaults

All Cordum service pods run with hardened security contexts:

securityContext:
  runAsNonRoot: true
  runAsUser: 65532
  runAsGroup: 65532
  fsGroup: 65532
  seccompProfile:
    type: RuntimeDefault
containers:
- securityContext:
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false

ServiceAccount

All pods use a dedicated cordum ServiceAccount with automountServiceAccountToken: false. No Cordum service needs Kubernetes API access, so the API token is not mounted into pods. This follows the principle of least privilege — compromised containers cannot access the K8s API or read cluster secrets.

If a future service needs K8s API access (e.g., for leader election), create a separate ServiceAccount with a scoped Role/RoleBinding.

Resource Quotas

The base manifest includes a ResourceQuota and LimitRange for the cordum namespace:

Quota	Value	Notes
`requests.cpu`	8	Total CPU requests across all pods
`limits.cpu`	16	Total CPU limits
`requests.memory`	8Gi	Total memory requests
`limits.memory`	16Gi	Total memory limits
`pods`	50	Max pod count (accommodates HPA max replicas + headroom)
`services`	20	Max Service count
`persistentvolumeclaims`	10	Max PVC count

The LimitRange assigns default resource requests (100m CPU, 128Mi memory) and limits (500m CPU, 256Mi memory) to containers that don't specify them.

Adjusting for larger deployments: Increase the ResourceQuota values in base.yaml or apply a kustomize patch in your production overlay:

# In production/kustomization.yaml patches:
- target:
    kind: ResourceQuota
    name: cordum-quota
  patch: |
    - op: replace
      path: /spec/hard/pods
      value: "100"
    - op: replace
      path: /spec/hard/limits.cpu
      value: "32"

Resource Requests/Limits

Service	CPU Request	CPU Limit	Memory Request	Memory Limit
NATS	100m	500m	128Mi	512Mi
Redis	100m	500m	256Mi	512Mi
Context Engine	100m	500m	128Mi	512Mi
Safety Kernel	100m	500m	128Mi	512Mi
Scheduler	150m	750m	256Mi	768Mi
API Gateway	200m	1000m	256Mi	1Gi
Workflow Engine	150m	750m	256Mi	768Mi
Dashboard	100m	500m	128Mi	512Mi

Dev Ingress

For local dev with an Ingress controller:

kubectl apply -f deploy/k8s/ingress.yaml

Routes:

/api/v1/* and /health → cordum-api-gateway:8081
/ → cordum-dashboard:8080

2. Production Overlay

The production overlay describes a topology that replaces development NATS and Redis with clustered StatefulSets, adds TLS, and includes HA, monitoring, network-policy, and backup resources.

:::warning Reference overlay, not an apply command The checked-in production overlay is currently a reference artifact. A normal Kustomize build is blocked by the parent-directory load restriction; disabling that restriction then exposes a duplicate cordum-nats-config resource ID. Until both manifest defects are fixed, reviewed, and released, do not use this overlay to create or update a cluster. :::

Render gate

After a manifest fix lands, render into a temporary file and require a zero exit code before any environment-specific review or deployment process:

tmp="$(mktemp)"
if ! kubectl kustomize deploy/k8s/production >"$tmp"; then
  rm -f "$tmp"
  echo "production overlay did not render; deployment blocked" >&2
  exit 1
fi
mv "$tmp" cordum-production.rendered.yaml

On the current revision this gate fails by design. A partial or failed render is not deployable evidence, and this guide intentionally provides no production apply command while the gate is red.

What the Overlay Does

The production/kustomization.yaml composes:

Resources added:

nats.yaml — 3-node NATS cluster (StatefulSet) with mTLS and JetStream persistence
redis.yaml — 6-node Redis cluster (StatefulSet) with TLS, exporter sidecar, and init job
ingress.yaml — TLS-terminated Ingress
ha.yaml — PodDisruptionBudgets + HorizontalPodAutoscalers
monitoring.yaml — ServiceMonitors + PrometheusRules (requires Prometheus Operator)
networkpolicy.yaml — Ingress and egress network policies
backup.yaml — Hourly CronJob backups for Redis and NATS

Patches applied:

delete-nats-deployment.yaml — Removes dev single-node NATS Deployment (replaced by StatefulSet)
delete-redis-deployment.yaml — Removes dev single-node Redis Deployment (replaced by StatefulSet)
tls-env.yaml — Injects TLS env vars and client cert volume mounts into all services
service-labels.yaml — Adds app labels to Services (required for ServiceMonitor selectors)
pod-anti-affinity.yaml — Spreads all pods across nodes via preferredDuringSchedulingIgnoredDuringExecution

Image tags:

Replace <tested-release-tag> with one release tag that you have validated across all Cordum services. Do not mix chart, documentation, and image versions implicitly.

images:
  - name: cordum-context-engine
    newTag: <tested-release-tag>
  - name: cordum-safety-kernel
    newTag: <tested-release-tag>
  - name: cordum-scheduler
    newTag: <tested-release-tag>
  - name: cordum-api-gateway
    newTag: <tested-release-tag>
  - name: cordum-workflow-engine
    newTag: <tested-release-tag>
  - name: cordum-dashboard
    newTag: <tested-release-tag>

Replica overrides:

replicas:
  - name: cordum-api-gateway
    count: 2
  - name: cordum-safety-kernel
    count: 2
  - name: cordum-scheduler
    count: 2

3. Secrets Management

Required Secrets

Secret	Keys	Purpose
`cordum-api-key`	`API_KEY`	API authentication
`cordum-admin-creds`	`CORDUM_ADMIN_USERNAME`, `CORDUM_ADMIN_PASSWORD`, `CORDUM_ADMIN_EMAIL`	User auth (optional)
`cordum-redis-secret`	`REDIS_PASSWORD`	Redis authentication for the cluster and Cordum clients
`cordum-nats-server-tls`	`tls.crt`, `tls.key`, `ca.crt`	NATS server TLS
`cordum-redis-server-tls`	`tls.crt`, `tls.key`, `ca.crt`	Redis server TLS
`cordum-client-tls`	`tls.crt`, `tls.key`, `ca.crt`	Client certs for services connecting to NATS/Redis
`cordum-server-tls`	`tls.crt`, `tls.key`, `ca.crt`	Server cert for Gateway, Safety Kernel, and Context Engine
`cordum-ingress-tls`	`tls.crt`, `tls.key`	Ingress TLS termination

Creating TLS Secrets

Generate a CA and service certificates (example using openssl):

# Generate CA
openssl genrsa -out ca.key 4096
openssl req -new -x509 -key ca.key -sha256 -days 3650 \
  -subj "/CN=Cordum Internal CA" -out ca.crt

# Generate NATS server cert
openssl genrsa -out nats.key 2048
openssl req -new -key nats.key -subj "/CN=nats" \
  -addext "subjectAltName=DNS:nats,DNS:cordum-nats,DNS:cordum-nats-0.cordum-nats.cordum.svc,DNS:cordum-nats-1.cordum-nats.cordum.svc,DNS:cordum-nats-2.cordum-nats.cordum.svc,DNS:*.cordum-nats.cordum.svc" \
  -out nats.csr
openssl x509 -req -in nats.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
  -days 365 -sha256 -copy_extensions copyall -out nats.crt

# Generate Redis server cert
openssl genrsa -out redis.key 2048
openssl req -new -key redis.key -subj "/CN=redis" \
  -addext "subjectAltName=DNS:redis,DNS:cordum-redis,DNS:*.cordum-redis.cordum.svc" \
  -out redis.csr
openssl x509 -req -in redis.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
  -days 365 -sha256 -copy_extensions copyall -out redis.crt

# Generate client cert (used by all Cordum services)
openssl genrsa -out client.key 2048
openssl req -new -key client.key -subj "/CN=cordum-client" -out client.csr
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
  -days 365 -sha256 -out client.crt

# Generate the shared Cordum server cert with every mounted service DNS name
openssl genrsa -out cordum-server.key 2048
openssl req -new -key cordum-server.key -subj "/CN=cordum-api-gateway" \
  -addext "subjectAltName=DNS:cordum-api-gateway,DNS:cordum-api-gateway.cordum.svc,DNS:cordum-safety-kernel,DNS:cordum-safety-kernel.cordum.svc,DNS:cordum-context-engine,DNS:cordum-context-engine.cordum.svc" \
  -out cordum-server.csr
openssl x509 -req -in cordum-server.csr -CA ca.crt -CAkey ca.key \
  -CAcreateserial -days 365 -sha256 -copy_extensions copyall \
  -out cordum-server.crt

# Create Kubernetes secrets
kubectl create namespace cordum --dry-run=client -o yaml | kubectl apply -f -

kubectl create secret generic cordum-redis-secret --namespace cordum \
  --from-literal=REDIS_PASSWORD="$(openssl rand -hex 32)"

kubectl create secret generic cordum-nats-server-tls --namespace cordum \
  --from-file=tls.crt=nats.crt --from-file=tls.key=nats.key --from-file=ca.crt=ca.crt

kubectl create secret generic cordum-redis-server-tls --namespace cordum \
  --from-file=tls.crt=redis.crt --from-file=tls.key=redis.key --from-file=ca.crt=ca.crt

kubectl create secret generic cordum-client-tls --namespace cordum \
  --from-file=tls.crt=client.crt --from-file=tls.key=client.key --from-file=ca.crt=ca.crt

kubectl create secret generic cordum-server-tls --namespace cordum \
  --from-file=tls.crt=cordum-server.crt \
  --from-file=tls.key=cordum-server.key \
  --from-file=ca.crt=ca.crt

Create cordum-api-key through your secret manager as well. Provision cordum-ingress-tls for the actual public hostnames; the internal CA example above is not a substitute for a browser-trusted ingress certificate. Never commit the generated private keys or rendered Secret objects.

The base manifest itself still defines empty API-key and Redis-password Secret objects. Any future production-overlay repair must replace or delete those placeholder resources in the rendered output; precreating a Secret and then applying a manifest that still contains an empty stringData value would erase the credential.

For cert-manager automation, create Certificate resources referencing a ClusterIssuer:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: cordum-nats-server-tls
  namespace: cordum
spec:
  secretName: cordum-nats-server-tls
  issuerRef:
    name: cordum-ca-issuer
    kind: ClusterIssuer
  commonName: nats
  dnsNames:
    - nats
    - cordum-nats
    - "*.cordum-nats.cordum.svc"

Certificate Rotation

Regenerate certs (or let cert-manager auto-renew)
Update the Secret: kubectl create secret generic ... --dry-run=client -o yaml | kubectl apply -f -
Restart affected pods: kubectl rollout restart deployment -n cordum

4. NATS Clustering

The reference overlay specifies a 3-node NATS StatefulSet.

Configuration

Replicas: 3 (cordum-nats-0, cordum-nats-1, cordum-nats-2)
Headless Service: cordum-nats (DNS-based peer discovery)
JetStream: Enabled with 20Gi persistent volume per node
TLS: Full mTLS between clients and cluster peers
Cluster routes: Hardcoded in ConfigMap for deterministic discovery
Monitoring: Port 8222 exposed for /healthz probes and metrics
Anti-affinity: Pods spread across nodes via preferredDuringSchedulingIgnoredDuringExecution

NATS ConfigMap (production)

port: 4222
http: 8222
jetstream {
  store_dir: /data/jetstream
  sync_interval: "1s"
}
tls {
  cert_file: /etc/nats/tls/tls.crt
  key_file: /etc/nats/tls/tls.key
  ca_file: /etc/nats/tls/ca.crt
  verify: true
}
cluster {
  name: cordum
  port: 6222
  routes = [
    nats://cordum-nats-0.cordum-nats.cordum.svc:6222
    nats://cordum-nats-1.cordum-nats.cordum.svc:6222
    nats://cordum-nats-2.cordum-nats.cordum.svc:6222
  ]
  tls { ... }
}

JetStream Durability

sync_interval: "1s" — fsync every second (trade-off: lower = more durable, slower)
Streams are replicated across all 3 nodes (NATS_JS_REPLICAS=3 in the TLS patch)
PVCs: 20Gi ReadWriteOnce per node

Tuning

Parameter	Default	Notes
`sync_interval`	`1s`	Lower for stricter durability, higher for throughput
`NATS_JS_REPLICAS`	`3`	Must not exceed cluster size
Storage per node	`20Gi`	Increase for high-volume deployments

5. Redis Clustering

The reference overlay specifies a 6-node Redis cluster (3 primaries + 3 replicas).

Configuration

Replicas: 6 (StatefulSet cordum-redis)
Headless Service: cordum-redis (DNS-based discovery)
TLS: Full TLS with tls-auth-clients yes
Persistence: AOF enabled (appendonly yes), 20Gi PVC per node
Cluster mode: cluster-enabled yes, cluster-node-timeout 5000
Exporter sidecar: oliver006/redis_exporter:v1.58.0 on port 9121

Cluster Init Job

After all 6 Redis pods are running, the cordum-redis-cluster-init Job must complete once:

# Check pod readiness
kubectl get pods -n cordum -l app=redis

# The init job runs automatically — check its status
kubectl get job cordum-redis-cluster-init -n cordum
kubectl logs job/cordum-redis-cluster-init -n cordum

The init job:

Waits for all 6 nodes to respond to PING
Runs redis-cli --cluster create ... --cluster-replicas 1 --cluster-yes

Pre-Flight Checks

Before running or re-running the init job, verify:

# All 6 pods must be Running and Ready
kubectl get pods -l app=redis -n cordum
# Expected: 6/6 pods in Running state with READY 2/2 (redis + exporter)

# TLS secret must exist
kubectl get secret cordum-client-tls -n cordum
kubectl get secret cordum-redis-server-tls -n cordum

# Password secret must be non-empty
kubectl get secret cordum-redis-secret -n cordum -o jsonpath='{.data.REDIS_PASSWORD}' | base64 -d | wc -c
# Expected: non-zero length

Troubleshooting Init Failures

Symptom	Cause	Fix
Job stuck in `Pending`	Missing TLS secret	Create `cordum-client-tls` secret
`Could not connect to Redis`	Pods not ready	Wait for all 6 pods, check startup probes
`ERR Invalid node address`	DNS not resolving	Verify headless service `cordum-redis` exists
`ERR Nodes don't agree about configuration`	Partial previous init	Delete all pods (`kubectl delete pods -l app=redis -n cordum`), wait for restart, re-run

Re-running: Redis cluster initialization is destructive if nodes already hold cluster metadata. First confirm why initialization is required, preserve a backup, and verify that your deployment controller can recreate the init Job from a previously validated manifest. Do not delete the Job merely to force a rerun, and do not use the currently unrenderable reference overlay.

For a complete key inventory, DR runbooks, and base-to-production migration, see Redis Operations Guide.

Client Connection

All Cordum services connect to Redis cluster via:

REDIS_CLUSTER_ADDRESSES=cordum-redis-0.cordum-redis.cordum.svc:6379,...,cordum-redis-5.cordum-redis.cordum.svc:6379
REDIS_URL=rediss://redis:6379
REDIS_TLS_CA=/etc/cordum/tls/client/ca.crt
REDIS_TLS_CERT=/etc/cordum/tls/client/tls.crt
REDIS_TLS_KEY=/etc/cordum/tls/client/tls.key

6. Network Policies

The reference overlay defines strict ingress and egress rules per service.

Ingress Rules

Target	Allowed Sources	Ports
NATS	All Cordum services + NATS peers	4222 (client), 6222 (cluster), 8222 (monitor)
Redis	All Cordum services + Redis peers	6379 (client), 16379 (cluster bus)

Egress Rules

Source	Allowed Destinations	Ports
API Gateway	NATS, Redis, Safety Kernel, DNS	4222, 6379, 50051, 53
Scheduler	NATS, Redis, Safety Kernel, DNS	4222, 6379, 50051, 53
Workflow Engine	NATS, Redis, DNS	4222, 6379, 53
Safety Kernel	NATS, Redis, DNS	4222, 6379, 53
Dashboard	API Gateway, DNS	8081, 53

Traffic Flow Diagram

                    ┌──────────┐
    Internet ──────►│ Ingress  │
                    └────┬─────┘
                         │
              ┌──────────┼──────────┐
              ▼                     ▼
      ┌───────────────┐    ┌───────────────┐
      │  API Gateway  │    │   Dashboard   │
      │  :8081 :8080  │◄───│     :8080     │
      │    :9092      │    └───────────────┘
      └──┬────┬───┬───┘
         │    │   │
    ┌────┘    │   └────┐
    ▼         ▼        ▼
┌───────┐ ┌───────┐ ┌──────────────┐
│ NATS  │ │ Redis │ │Safety Kernel │
│ :4222 │ │ :6379 │ │   :50051     │
│ :6222 │ │:16379 │ └──────┬───┬───┘
└───┬───┘ └───┬───┘        │   │
    │         │        ┌───┘   └───┐
    ▼         ▼        ▼           ▼
┌──────────┐      ┌───────┐   ┌───────┐
│Scheduler │      │ NATS  │   │ Redis │
│  :9090   │      └───────┘   └───────┘
└──┬───┬───┘
   │   │
   ▼   ▼
┌───────┐ ┌───────┐
│ NATS  │ │ Redis │
└───────┘ └───────┘

┌─────────────────┐
│ Workflow Engine  │──► NATS, Redis
│     :9093       │
└─────────────────┘

┌─────────────────┐
│ Context Engine  │──► Redis
│    :50070       │
└─────────────────┘

7. Ingress Configuration

Dev Ingress (no TLS)

kubectl apply -f deploy/k8s/ingress.yaml

Routes /api/v1/* and /health to the gateway, / to the dashboard. No TLS.

Production Ingress (TLS)

The reference overlay includes production/ingress.yaml with:

annotations:
  nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
  nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"   # WebSocket support
  nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
  tls:
  - hosts:
    - cordum.example.com
    secretName: cordum-ingress-tls

Before applying, update the hostname from cordum.example.com to your actual domain.

Create the Ingress TLS secret:

kubectl create secret tls cordum-ingress-tls --namespace cordum \
  --cert=cordum-ingress.crt --key=cordum-ingress.key

Annotations for Other Ingress Controllers

Traefik:

annotations:
  traefik.ingress.kubernetes.io/router.tls: "true"
  traefik.ingress.kubernetes.io/router.entrypoints: websecure

Istio (VirtualService): Use an Istio Gateway + VirtualService instead of the Ingress resource.

8. Monitoring

The reference overlay includes resources that require Prometheus Operator CRDs.

Prerequisite: Install the Prometheus Operator or kube-prometheus-stack Helm chart.

ServiceMonitors

Target	Port	Path	Interval
`cordum-api-gateway`	`metrics` (9092)	`/metrics`	30s
`cordum-scheduler`	`metrics` (9090)	`/metrics`	30s
`cordum-nats-monitor`	`monitor` (8222)	`/metrics`	30s
`cordum-redis` (exporter)	`metrics` (9121)	`/metrics`	30s

Alert Rules (PrometheusRule)

Alert	Expression	Severity	For
`CordumGatewayDown`	`sum(up{service="cordum-api-gateway"}) == 0`	critical	5m
`CordumSchedulerDown`	`sum(up{service="cordum-scheduler"}) == 0`	critical	5m
`CordumNATSDown`	`sum(up{service="cordum-nats-monitor"}) == 0`	critical	5m
`CordumRedisDown`	`sum(up{service="cordum-redis"}) == 0`	critical	5m

Key Metrics

Metric	Source	Description
`cordum_jobs_dispatched_total`	Scheduler	Jobs dispatched by pool/type
`cordum_jobs_duration_seconds`	Scheduler	Job latency histogram
`cordum_safety_evaluations_total`	Gateway	Policy evaluations by decision
`cordum_http_requests_total`	Gateway	HTTP requests by method/path/status

9. Scaling

HorizontalPodAutoscalers

Service	Min	Max	CPU Target	Memory Target
`cordum-api-gateway`	2	10	70%	80%
`cordum-scheduler`	2	10	70%	80%

Scaling Recommendations

Service	Scale When	Notes
API Gateway	High HTTP request volume	Stateless — scales freely
Safety Kernel	High policy evaluation load	Stateless — scales freely
Scheduler	Large job backlogs	Stateful leader election — multiple replicas use Redis locking
Workflow Engine	Many concurrent workflow runs	Single instance recommended unless using Redis-based coordination
Context Engine	High context read/write volume	Stateless — scales freely
Dashboard	High user traffic	Static assets — scales freely
NATS	N/A	Fixed 3-node cluster; increase storage, not replicas
Redis	N/A	Fixed 6-node cluster; increase storage or reshard

PodDisruptionBudgets

All critical services have PDBs with maxUnavailable: 1:

cordum-api-gateway
cordum-scheduler
cordum-workflow-engine
cordum-safety-kernel
cordum-dashboard

Infrastructure StatefulSets use minAvailable to preserve quorum and data availability:

StatefulSet	`minAvailable`	Rationale
NATS (3 nodes)	2	Maintains Raft quorum during node drains — losing 2 of 3 nodes would break consensus
Redis (6 nodes: 3 primary + 3 replica)	4	Ensures at least 2 primary + 2 replica survive, maintaining data availability during rolling upgrades

See Horizontal Scaling Guide for NATS delivery semantics and multi-replica considerations.

10. Backups

The reference overlay includes hourly CronJob definitions for Redis and NATS.

Redis Backup

Schedule: 0 * * * * (hourly)
Method: the CronJob checks all six nodes and runs redis-cli --rdb once for each node reporting the primary role
Storage: cordum-backups PVC (20Gi)
Retention: Last 2 successful + 2 failed jobs kept
File format: redis-YYYYMMDDTHHMMSSZ-nodeN.rdb, one file per primary

NATS JetStream Backup

Schedule: 0 * * * * (hourly)
Method: nats stream snapshot for CORDUM_SYS and CORDUM_JOBS streams
Storage: Same cordum-backups PVC
File format: nats-YYYYMMDDTHHMMSSZ/{stream}.snapshot

Restore runbook boundary

The repository does not ship a tested, automated cluster-restore procedure. The backup CronJobs prove that snapshot files were requested; they do not prove that a consistent service can be recovered. Do not copy an RDB into one live Redis pod, and do not expect the NATS server image to contain the nats CLI.

Before declaring a production deployment recoverable, an operator-owned and rehearsed restore runbook must cover all of the following:

Quiesce or isolate writers and record the incident timestamp, target RPO, Redis slot ownership, and NATS account/stream configuration.
Select one timestamp-consistent RDB set covering every Redis primary. Restore it into a disposable cluster running the same Redis version, then validate slot coverage, key counts, application reads, and replica health before any cutover.
Restore JetStream snapshots from a separate, pinned nats-box workload with the client TLS Secret mounted. Validate the target account, stream names, replica settings, consumers, and message counts before reconnecting writers.
Exercise rollback, capture restore logs and checksums, and record measured RPO/RTO. A backup is not accepted until this rehearsal succeeds.

Keep the runbook and its approved commands in the operating environment, where storage class, encryption, account, and cutover details can be reviewed safely.

11. Upgrade Procedures

Rolling Updates

All Cordum Deployments use the default RollingUpdate strategy. To upgrade:

Update image tags in production/kustomization.yaml
Run the render gate and require a zero exit code.
Review the rendered diff and promote it through the environment's GitOps or change-control process. The current reference overlay cannot pass this step.
Watch rollout: kubectl rollout status deployment/cordum-api-gateway -n cordum

Pre-Upgrade Checklist

Back up Redis and NATS (or verify CronJob backups are recent)
Review changelog for breaking changes
Check that PDBs will allow the rolling update given current replica counts
If config schema changed, update ConfigMaps before rolling deployments

Rollback

kubectl rollout undo deployment/cordum-api-gateway -n cordum
kubectl rollout undo deployment/cordum-scheduler -n cordum
kubectl rollout undo deployment/cordum-safety-kernel -n cordum
kubectl rollout undo deployment/cordum-workflow-engine -n cordum
kubectl rollout undo deployment/cordum-context-engine -n cordum
kubectl rollout undo deployment/cordum-dashboard -n cordum

12. Troubleshooting

Common Issues

Symptom	Likely Cause	Fix
`ImagePullBackOff`	Missing image or wrong tag	Check `kustomization.yaml` image tags and registry access
`CrashLoopBackOff` on gateway	Missing `API_KEY` secret	Create `cordum-api-key` secret
`CrashLoopBackOff` on scheduler	Can't reach NATS/Redis	Check service DNS, network policies, TLS certs
Redis cluster init job stuck	Pods not ready yet	Wait for all 6 Redis pods, then delete/recreate the job
NATS `connection refused`	TLS mismatch	Verify SANs on certs match service DNS names
`OOMKilled`	Memory limit too low	Increase `resources.limits.memory` in the Deployment
ServiceMonitor not scraped	Missing service labels	Verify `service-labels.yaml` patch was applied
Ingress 502 errors	Gateway not ready	Check readiness probe, verify gateway pod is running

Useful Commands

# Check all pods
kubectl get pods -n cordum -o wide

# Check events (recent errors)
kubectl get events -n cordum --sort-by=.lastTimestamp | tail -20

# Logs for a specific service
kubectl logs -n cordum deployment/cordum-api-gateway --tail=100

# Check Redis cluster status
kubectl exec -n cordum cordum-redis-0 -- redis-cli --tls \
  --cacert /etc/redis/tls/ca.crt --cert /etc/redis/tls/tls.crt \
  --key /etc/redis/tls/tls.key cluster info

# Inspect the pinned nats-box backup workload; the server image has no nats CLI
kubectl describe cronjob cordum-nats-backup -n cordum

# Check network policies
kubectl get networkpolicy -n cordum

# Force config reload (delete cached config in Redis)
kubectl exec -n cordum deployment/cordum-api-gateway -- \
  redis-cli -h redis -p 6379 DEL cfg:system:default

13. Helm Charts

The cordum-helm/ directory provides an alternative deployment method using Helm v3.

Chart Structure

The Helm chart package version and the Cordum application image tag are separate version streams. Treat Chart.yaml as chart metadata and pin global.image.tag to the application release you validated.

cordum-helm/
├── Chart.yaml                          # Chart metadata (version 0.2.0)
├── values.yaml                         # Default values
├── README.md                           # Chart documentation
└── templates/
    ├── _helpers.tpl                    # Template helpers
    ├── configmap.yaml                  # Pools, timeouts, safety config
    ├── configmap-nats.yaml             # NATS server configuration
    ├── deployment-control-plane.yaml   # All control plane services
    ├── deployment-dashboard.yaml       # Dashboard deployment
    ├── ingress.yaml                    # Ingress resource
    ├── secret.yaml                     # API key + admin password
    ├── service.yaml                    # Service definitions
    └── serviceaccount.yaml             # ServiceAccount

Key values.yaml Overrides

Key	Default	Description
`global.image.registry`	`ghcr.io/cordum-io/cordum`	Container registry prefix; each service appends its own name (e.g. `/api-gateway`)
`global.image.tag`	`1.0.0`	Image tag for all services
`secrets.apiKey`	`""`	API key (required)
`secrets.adminPassword`	`""`	Admin password for user auth
`nats.enabled`	`true`	Deploy bundled NATS
`nats.persistence.enabled`	`false`	Enable JetStream persistence
`redis.enabled`	`true`	Deploy bundled Redis
`redis.persistence.enabled`	`false`	Enable Redis persistence
`external.natsUrl`	`""`	Use external NATS (disables bundled)
`external.redisUrl`	`""`	Use external Redis (disables bundled)
`external.safetyKernelAddr`	`""`	Use external safety kernel
`gateway.replicaCount`	`1`	Gateway replicas
`gateway.env.apiRateLimitRps`	`50`	API rate limit
`gateway.env.userAuthEnabled`	`false`	Enable user/password auth
`scheduler.replicaCount`	`1`	Scheduler replicas
`ingress.enabled`	`false`	Create Ingress resource
`ingress.className`	`""`	Ingress class (nginx, traefik)
`ingress.tls`	`[]`	TLS configuration

Install

# Basic install
helm install cordum ./cordum-helm \
  --namespace cordum --create-namespace \
  --set secrets.apiKey="$(openssl rand -hex 32)"

# With persistence and ingress
helm install cordum ./cordum-helm \
  --namespace cordum --create-namespace \
  --set secrets.apiKey="$(openssl rand -hex 32)" \
  --set nats.persistence.enabled=true \
  --set redis.persistence.enabled=true \
  --set ingress.enabled=true \
  --set ingress.className=nginx

Production Values Override

Create a values-production.yaml file:

# values-production.yaml
global:
  image:
    tag: "<tested-release-tag>"

secrets:
  apiKey: ""  # Set via --set or external secret

# Use external managed services
nats:
  enabled: false

redis:
  enabled: false
  auth:
    enabled: true
    existingSecret: cordum-external-redis
    existingSecretKey: REDIS_PASSWORD

external:
  natsUrl: "nats://nats-cluster.infra:4222"
  redisUrl: "rediss://:$(REDIS_PASSWORD)@redis-cluster.infra:6379"

# Scale control plane
gateway:
  replicaCount: 2
  env:
    apiRateLimitRps: 200
    apiRateLimitBurst: 400
    userAuthEnabled: true
  resources:
    limits:
      cpu: 2000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

scheduler:
  replicaCount: 2
  resources:
    limits:
      cpu: 2000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

safetyKernel:
  replicaCount: 2
  resources:
    limits:
      cpu: 1000m
      memory: 512Mi

workflowEngine:
  replicaCount: 1
  resources:
    limits:
      cpu: 2000m
      memory: 1Gi

# Enable ingress with TLS
ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  tls:
    - secretName: cordum-tls
      hosts:
        - cordum.example.com
        - api.cordum.example.com
  api:
    host: api.cordum.example.com
  dashboard:
    host: cordum.example.com

Deploy with overrides:

set -euo pipefail
: "${EXTERNAL_REDIS_PASSWORD:?Set EXTERNAL_REDIS_PASSWORD before deploying}"
kubectl create namespace cordum --dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic cordum-external-redis \
  --namespace cordum \
  --from-literal=REDIS_PASSWORD="$EXTERNAL_REDIS_PASSWORD" \
  --dry-run=client -o yaml | kubectl apply -f -

helm install cordum ./cordum-helm \
  --namespace cordum --create-namespace \
  -f values-production.yaml \
  --set secrets.apiKey="$(openssl rand -hex 32)" \
  --set secrets.adminPassword="$(openssl rand -base64 24)"

Upgrade

helm upgrade cordum ./cordum-helm \
  --namespace cordum \
  -f values-production.yaml \
  --set global.image.tag="<tested-release-tag>"

Helm vs Kustomize

Feature	Helm (`cordum-helm/`)	Kustomize (`deploy/k8s/production/`)
Bundled NATS/Redis	Toggle with `nats.enabled` / `redis.enabled`	Separate StatefulSets in overlay
External services	`external.natsUrl` / `external.redisUrl`	Manual env patch
TLS/mTLS	Not built-in (use external)	Full TLS overlay with patches
Redis Cluster	Not supported (single instance)	6-node cluster with init job
NATS Cluster	Not supported (single instance)	3-node StatefulSet
Network Policies	Not included	Full ingress + egress policies
Monitoring	Not included	ServiceMonitors + PrometheusRules
HA (PDB/HPA)	Manual replica count	PDBs + HPAs included
Best for	Quick starts, managed infrastructure	Topology reference until the render gate is repaired

For production with self-hosted NATS/Redis, the kustomize overlay in deploy/k8s/production/ is recommended. Use Helm when relying on external managed services (Amazon ElastiCache, Amazon MQ, etc.).

production.md — Production readiness guide with DR, runbooks, and scaling
production-gate.md — Automated production gate script
configuration.md — Config files and environment variables
configuration-reference.md — Full config schema reference
DOCKER.md — Docker Compose development setup
helm.md — Helm chart deployment

Directory Structure​

1. Base Manifest (Development)​

Resource Inventory​

ConfigMaps​

Secrets​

Security Defaults​

ServiceAccount​

Resource Quotas​

Resource Requests/Limits​

Dev Ingress​

2. Production Overlay​

Render gate​

What the Overlay Does​

3. Secrets Management​

Required Secrets​

Creating TLS Secrets​

Certificate Rotation​

4. NATS Clustering​

Configuration​

NATS ConfigMap (production)​

JetStream Durability​

Tuning​

5. Redis Clustering​

Configuration​

Cluster Init Job​

Pre-Flight Checks​

Troubleshooting Init Failures​

Client Connection​

6. Network Policies​

Ingress Rules​

Egress Rules​

Traffic Flow Diagram​

7. Ingress Configuration​

Dev Ingress (no TLS)​

Production Ingress (TLS)​

Annotations for Other Ingress Controllers​

8. Monitoring​

ServiceMonitors​

Alert Rules (PrometheusRule)​

Key Metrics​

9. Scaling​

HorizontalPodAutoscalers​

Scaling Recommendations​

PodDisruptionBudgets​

10. Backups​

Redis Backup​

NATS JetStream Backup​

Restore runbook boundary​

11. Upgrade Procedures​

Rolling Updates​

Pre-Upgrade Checklist​

Rollback​

12. Troubleshooting​

Common Issues​

Useful Commands​

13. Helm Charts​

Chart Structure​

Key values.yaml Overrides​

Install​

Production Values Override​

Upgrade​

Helm vs Kustomize​

Related Docs​

Directory Structure

1. Base Manifest (Development)

Resource Inventory

ConfigMaps

Secrets

Security Defaults

ServiceAccount

Resource Quotas

Resource Requests/Limits

Dev Ingress

2. Production Overlay

Render gate

What the Overlay Does

3. Secrets Management

Required Secrets

Creating TLS Secrets

Certificate Rotation

4. NATS Clustering

Configuration

NATS ConfigMap (production)

JetStream Durability

Tuning

5. Redis Clustering

Configuration

Cluster Init Job

Pre-Flight Checks

Troubleshooting Init Failures

Client Connection

6. Network Policies

Ingress Rules

Egress Rules

Traffic Flow Diagram

7. Ingress Configuration

Dev Ingress (no TLS)

Production Ingress (TLS)

Annotations for Other Ingress Controllers

8. Monitoring

ServiceMonitors

Alert Rules (PrometheusRule)

Key Metrics

9. Scaling

HorizontalPodAutoscalers

Scaling Recommendations

PodDisruptionBudgets

10. Backups

Redis Backup

NATS JetStream Backup

Restore runbook boundary

11. Upgrade Procedures

Rolling Updates

Pre-Upgrade Checklist

Rollback

12. Troubleshooting

Common Issues

Useful Commands

13. Helm Charts

Chart Structure

Key values.yaml Overrides

Install

Production Values Override

Upgrade

Helm vs Kustomize

Related Docs