Skip to main content

Kubernetes Deployment


Directory Structure

deploy/k8s/
├── base.yaml # All-in-one dev manifest
├── ingress.yaml # Dev ingress (no TLS)
├── local-patches.yaml # Local dev overrides
├── local/
│ └── kustomization.yaml # Local kustomize overlay
├── production/
│ ├── kustomization.yaml # Production overlay entry point
│ ├── nats.yaml # NATS 3-node StatefulSet + TLS cluster
│ ├── redis.yaml # Redis 6-node cluster + TLS + exporter
│ ├── ingress.yaml # Ingress with TLS termination
│ ├── ha.yaml # PDBs + HPAs
│ ├── monitoring.yaml # ServiceMonitors + PrometheusRules
│ ├── networkpolicy.yaml # Ingress/egress network policies
│ ├── backup.yaml # CronJob backups (Redis RDB + NATS snapshots)
│ └── patches/
│ ├── delete-nats-deployment.yaml # Remove dev NATS Deployment
│ ├── delete-redis-deployment.yaml # Remove dev Redis Deployment
│ ├── tls-env.yaml # Inject TLS env vars + volume mounts
│ ├── service-labels.yaml # Add app labels to Services
│ └── pod-anti-affinity.yaml # Spread pods across nodes
└── README.md


2. Production Overlay

The production overlay replaces dev-mode NATS/Redis with clustered StatefulSets, adds TLS everywhere, enables HA (PDBs + HPAs), monitoring, network policies, and automated backups.

Apply

kubectl apply -k deploy/k8s/production

What the Overlay Does

The production/kustomization.yaml composes:

Resources added:

  • nats.yaml — 3-node NATS cluster (StatefulSet) with mTLS and JetStream persistence
  • redis.yaml — 6-node Redis cluster (StatefulSet) with TLS, exporter sidecar, and init job
  • ingress.yaml — TLS-terminated Ingress
  • ha.yaml — PodDisruptionBudgets + HorizontalPodAutoscalers
  • monitoring.yaml — ServiceMonitors + PrometheusRules (requires Prometheus Operator)
  • networkpolicy.yaml — Ingress and egress network policies
  • backup.yaml — Hourly CronJob backups for Redis and NATS

Patches applied:

  • delete-nats-deployment.yaml — Removes dev single-node NATS Deployment (replaced by StatefulSet)
  • delete-redis-deployment.yaml — Removes dev single-node Redis Deployment (replaced by StatefulSet)
  • tls-env.yaml — Injects TLS env vars and client cert volume mounts into all services
  • service-labels.yaml — Adds app labels to Services (required for ServiceMonitor selectors)
  • pod-anti-affinity.yaml — Spreads all pods across nodes via preferredDuringSchedulingIgnoredDuringExecution

Image tags:

images:
- name: cordum-context-engine
newTag: v0.9.7
- name: cordum-safety-kernel
newTag: v0.9.7
- name: cordum-scheduler
newTag: v0.9.7
- name: cordum-api-gateway
newTag: v0.9.7
- name: cordum-workflow-engine
newTag: v0.9.7
- name: cordum-dashboard
newTag: v0.9.7

Replica overrides:

replicas:
- name: cordum-api-gateway
count: 2
- name: cordum-safety-kernel
count: 2
- name: cordum-scheduler
count: 2


4. NATS Clustering

The production overlay deploys a 3-node NATS cluster as a StatefulSet.

Configuration

  • Replicas: 3 (cordum-nats-0, cordum-nats-1, cordum-nats-2)
  • Headless Service: cordum-nats (DNS-based peer discovery)
  • JetStream: Enabled with 20Gi persistent volume per node
  • TLS: Full mTLS between clients and cluster peers
  • Cluster routes: Hardcoded in ConfigMap for deterministic discovery
  • Monitoring: Port 8222 exposed for /healthz probes and metrics
  • Anti-affinity: Pods spread across nodes via preferredDuringSchedulingIgnoredDuringExecution

NATS ConfigMap (production)

port: 4222
http: 8222
jetstream {
store_dir: /data/jetstream
sync_interval: "1s"
}
tls {
cert_file: /etc/nats/tls/tls.crt
key_file: /etc/nats/tls/tls.key
ca_file: /etc/nats/tls/ca.crt
verify: true
}
cluster {
name: cordum
port: 6222
routes = [
nats://cordum-nats-0.cordum-nats.cordum.svc:6222
nats://cordum-nats-1.cordum-nats.cordum.svc:6222
nats://cordum-nats-2.cordum-nats.cordum.svc:6222
]
tls { ... }
}

JetStream Durability

  • sync_interval: "1s" — fsync every second (trade-off: lower = more durable, slower)
  • Streams are replicated across all 3 nodes (NATS_JS_REPLICAS=3 in the TLS patch)
  • PVCs: 20Gi ReadWriteOnce per node

Tuning

ParameterDefaultNotes
sync_interval1sLower for stricter durability, higher for throughput
NATS_JS_REPLICAS3Must not exceed cluster size
Storage per node20GiIncrease for high-volume deployments


6. Network Policies

The production overlay defines strict ingress and egress rules per service.

Ingress Rules

TargetAllowed SourcesPorts
NATSAll Cordum services + NATS peers4222 (client), 6222 (cluster), 8222 (monitor)
RedisAll Cordum services + Redis peers6379 (client), 16379 (cluster bus)

Egress Rules

SourceAllowed DestinationsPorts
API GatewayNATS, Redis, Safety Kernel, DNS4222, 6379, 50051, 53
SchedulerNATS, Redis, Safety Kernel, DNS4222, 6379, 50051, 53
Workflow EngineNATS, Redis, DNS4222, 6379, 53
Safety KernelNATS, Redis, DNS4222, 6379, 53
DashboardAPI Gateway, DNS8081, 53

Traffic Flow Diagram

┌──────────┐
Internet ──────►│ Ingress │
└────┬─────┘

┌──────────┼──────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ API Gateway │ │ Dashboard │
│ :8081 :8080 │◄───│ :8080 │
│ :9092 │ └───────────────┘
└──┬────┬───┬───┘
│ │ │
┌────┘ │ └────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌──────────────┐
│ NATS │ │ Redis │ │Safety Kernel │
│ :4222 │ │ :6379 │ │ :50051 │
│ :6222 │ │:16379 │ └──────┬───┬───┘
└───┬───┘ └───┬───┘ │ │
│ │ ┌───┘ └───┐
▼ ▼ ▼ ▼
┌──────────┐ ┌───────┐ ┌───────┐
│Scheduler │ │ NATS │ │ Redis │
│ :9090 │ └───────┘ └───────┘
└──┬───┬───┘
│ │
▼ ▼
┌───────┐ ┌───────┐
│ NATS │ │ Redis │
└───────┘ └───────┘

┌─────────────────┐
│ Workflow Engine │──► NATS, Redis
│ :9093 │
└─────────────────┘

┌─────────────────┐
│ Context Engine │──► Redis
│ :50070 │
└─────────────────┘


8. Monitoring

The production overlay deploys Prometheus Operator CRDs.

Prerequisite: Install the Prometheus Operator or kube-prometheus-stack Helm chart.

ServiceMonitors

TargetPortPathInterval
cordum-api-gatewaymetrics (9092)/metrics30s
cordum-schedulermetrics (9090)/metrics30s
cordum-nats-monitormonitor (8222)/metrics30s
cordum-redis (exporter)metrics (9121)/metrics30s

Alert Rules (PrometheusRule)

AlertExpressionSeverityFor
CordumGatewayDownsum(up{service="cordum-api-gateway"}) == 0critical5m
CordumSchedulerDownsum(up{service="cordum-scheduler"}) == 0critical5m
CordumNATSDownsum(up{service="cordum-nats-monitor"}) == 0critical5m
CordumRedisDownsum(up{service="cordum-redis"}) == 0critical5m

Key Metrics

MetricSourceDescription
cordum_jobs_dispatched_totalSchedulerJobs dispatched by pool/type
cordum_jobs_duration_secondsSchedulerJob latency histogram
cordum_safety_evaluations_totalGatewayPolicy evaluations by decision
cordum_http_requests_totalGatewayHTTP requests by method/path/status


10. Backups

The production overlay includes hourly CronJob backups for both Redis and NATS.

Redis Backup

  • Schedule: 0 * * * * (hourly)
  • Method: redis-cli --rdb creates an RDB snapshot from cordum-redis-0
  • Storage: cordum-backups PVC (20Gi)
  • Retention: Last 2 successful + 2 failed jobs kept
  • File format: redis-YYYYMMDDTHHMMSSZ.rdb

NATS JetStream Backup

  • Schedule: 0 * * * * (hourly)
  • Method: nats stream snapshot for CORDUM_SYS and CORDUM_JOBS streams
  • Storage: Same cordum-backups PVC
  • File format: nats-YYYYMMDDTHHMMSSZ/{stream}.snapshot

Restore Procedures

Redis:

# Stop the cluster, copy RDB to data dir, restart
kubectl cp backup/redis-20260213T120000Z.rdb cordum/cordum-redis-0:/data/dump.rdb
kubectl delete pod cordum-redis-0 -n cordum # Pod restarts and loads RDB

NATS:

# Restore a stream snapshot
kubectl exec -n cordum cordum-nats-0 -- nats stream restore CORDUM_JOBS /backup/nats-.../CORDUM_JOBS.snapshot


12. Troubleshooting

Common Issues

SymptomLikely CauseFix
ImagePullBackOffMissing image or wrong tagCheck kustomization.yaml image tags and registry access
CrashLoopBackOff on gatewayMissing API_KEY secretCreate cordum-api-key secret
CrashLoopBackOff on schedulerCan't reach NATS/RedisCheck service DNS, network policies, TLS certs
Redis cluster init job stuckPods not ready yetWait for all 6 Redis pods, then delete/recreate the job
NATS connection refusedTLS mismatchVerify SANs on certs match service DNS names
OOMKilledMemory limit too lowIncrease resources.limits.memory in the Deployment
ServiceMonitor not scrapedMissing service labelsVerify service-labels.yaml patch was applied
Ingress 502 errorsGateway not readyCheck readiness probe, verify gateway pod is running

Useful Commands

# Check all pods
kubectl get pods -n cordum -o wide

# Check events (recent errors)
kubectl get events -n cordum --sort-by=.lastTimestamp | tail -20

# Logs for a specific service
kubectl logs -n cordum deployment/cordum-api-gateway --tail=100

# Check Redis cluster status
kubectl exec -n cordum cordum-redis-0 -- redis-cli --tls \
--cacert /etc/redis/tls/ca.crt --cert /etc/redis/tls/tls.crt \
--key /etc/redis/tls/tls.key cluster info

# Check NATS JetStream streams
kubectl exec -n cordum cordum-nats-0 -- nats --tlscacert /etc/nats/tls/ca.crt \
--tlscert /etc/nats/tls/tls.crt --tlskey /etc/nats/tls/tls.key stream ls

# Check network policies
kubectl get networkpolicy -n cordum

# Force config reload (delete cached config in Redis)
kubectl exec -n cordum deployment/cordum-api-gateway -- \
redis-cli -h redis -p 6379 DEL cfg:system:default


Kubernetes Deployment Guide

Detailed guide for deploying Cordum on Kubernetes — from dev-mode single-apply to production-hardened overlays with TLS, clustering, monitoring, and backups.

For Docker Compose local development, see DOCKER.md. For the production readiness checklist, see production.md. For Helm chart deployment, see helm.md.


Directory Structure

deploy/k8s/
├── base.yaml # All-in-one dev manifest
├── ingress.yaml # Dev ingress (no TLS)
├── local-patches.yaml # Local dev overrides
├── local/
│ └── kustomization.yaml # Local kustomize overlay
├── production/
│ ├── kustomization.yaml # Production overlay entry point
│ ├── nats.yaml # NATS 3-node StatefulSet + TLS cluster
│ ├── redis.yaml # Redis 6-node cluster + TLS + exporter
│ ├── ingress.yaml # Ingress with TLS termination
│ ├── ha.yaml # PDBs + HPAs
│ ├── monitoring.yaml # ServiceMonitors + PrometheusRules
│ ├── networkpolicy.yaml # Ingress/egress network policies
│ ├── backup.yaml # CronJob backups (Redis RDB + NATS snapshots)
│ └── patches/
│ ├── delete-nats-deployment.yaml # Remove dev NATS Deployment
│ ├── delete-redis-deployment.yaml # Remove dev Redis Deployment
│ ├── tls-env.yaml # Inject TLS env vars + volume mounts
│ ├── service-labels.yaml # Add app labels to Services
│ └── pod-anti-affinity.yaml # Spread pods across nodes
└── README.md

1. Base Manifests (Dev/Staging)

deploy/k8s/base.yaml is a single-file manifest containing all resources for a minimal deployment. Apply it directly for dev or staging:

kubectl apply -f deploy/k8s/base.yaml

Resource Inventory

ResourceKindPortsProbe
natsDeployment (1 replica)4222 (client)TCP :4222
redisDeployment (1 replica)6379TCP :6379
cordum-context-engineDeployment (1 replica)50070 (gRPC)gRPC :50070
cordum-safety-kernelDeployment (1 replica)50051 (gRPC)gRPC :50051
cordum-schedulerDeployment (1 replica)9090 (metrics)HTTP /metrics :9090
cordum-api-gatewayDeployment (1 replica)8080 (gRPC), 8081 (HTTP), 9092 (metrics)HTTP /health :8081
cordum-workflow-engineDeployment (1 replica)9093 (HTTP)HTTP /health :9093
cordum-dashboardDeployment (1 replica)8080 (HTTP)HTTP / :8080

ConfigMaps

ConfigMapMount PathDescription
cordum-pools/etc/cordum/pools.yamlTopic-to-pool routing
cordum-timeouts/etc/cordum/timeouts.yamlDispatch/running timeouts, reconciler interval
cordum-nats-config/etc/nats/nats.confNATS server config (JetStream sync)
cordum-safety/etc/cordum/safety.yamlSafety kernel policy

Secrets

SecretKeysUsed By
cordum-api-keyAPI_KEYGateway, Dashboard
cordum-admin-credsCORDUM_ADMIN_USERNAME, CORDUM_ADMIN_PASSWORD, CORDUM_ADMIN_EMAILGateway (optional)

Important: Set API_KEY to a strong value before applying:

kubectl create secret generic cordum-api-key \
--namespace cordum \
--from-literal=API_KEY="$(openssl rand -hex 32)"

Security Defaults

All Cordum service pods run with hardened security contexts:

securityContext:
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
fsGroup: 65532
seccompProfile:
type: RuntimeDefault
containers:
- securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false

ServiceAccount

All pods use a dedicated cordum ServiceAccount with automountServiceAccountToken: false. No Cordum service needs Kubernetes API access, so the API token is not mounted into pods. This follows the principle of least privilege — compromised containers cannot access the K8s API or read cluster secrets.

If a future service needs K8s API access (e.g., for leader election), create a separate ServiceAccount with a scoped Role/RoleBinding.

Resource Quotas

The base manifest includes a ResourceQuota and LimitRange for the cordum namespace:

QuotaValueNotes
requests.cpu8Total CPU requests across all pods
limits.cpu16Total CPU limits
requests.memory8GiTotal memory requests
limits.memory16GiTotal memory limits
pods50Max pod count (accommodates HPA max replicas + headroom)
services20Max Service count
persistentvolumeclaims10Max PVC count

The LimitRange assigns default resource requests (100m CPU, 128Mi memory) and limits (500m CPU, 256Mi memory) to containers that don't specify them.

Adjusting for larger deployments: Increase the ResourceQuota values in base.yaml or apply a kustomize patch in your production overlay:

# In production/kustomization.yaml patches:
- target:
kind: ResourceQuota
name: cordum-quota
patch: |
- op: replace
path: /spec/hard/pods
value: "100"
- op: replace
path: /spec/hard/limits.cpu
value: "32"

Resource Requests/Limits

ServiceCPU RequestCPU LimitMemory RequestMemory Limit
NATS100m500m128Mi512Mi
Redis100m500m256Mi512Mi
Context Engine100m500m128Mi512Mi
Safety Kernel100m500m128Mi512Mi
Scheduler150m750m256Mi768Mi
API Gateway200m1000m256Mi1Gi
Workflow Engine150m750m256Mi768Mi
Dashboard100m500m128Mi512Mi

Dev Ingress

For local dev with an Ingress controller:

kubectl apply -f deploy/k8s/ingress.yaml

Routes:

  • /api/v1/* and /healthcordum-api-gateway:8081
  • /cordum-dashboard:8080

2. Production Overlay

The production overlay replaces dev-mode NATS/Redis with clustered StatefulSets, adds TLS everywhere, enables HA (PDBs + HPAs), monitoring, network policies, and automated backups.

Apply

kubectl apply -k deploy/k8s/production

What the Overlay Does

The production/kustomization.yaml composes:

Resources added:

  • nats.yaml — 3-node NATS cluster (StatefulSet) with mTLS and JetStream persistence
  • redis.yaml — 6-node Redis cluster (StatefulSet) with TLS, exporter sidecar, and init job
  • ingress.yaml — TLS-terminated Ingress
  • ha.yaml — PodDisruptionBudgets + HorizontalPodAutoscalers
  • monitoring.yaml — ServiceMonitors + PrometheusRules (requires Prometheus Operator)
  • networkpolicy.yaml — Ingress and egress network policies
  • backup.yaml — Hourly CronJob backups for Redis and NATS

Patches applied:

  • delete-nats-deployment.yaml — Removes dev single-node NATS Deployment (replaced by StatefulSet)
  • delete-redis-deployment.yaml — Removes dev single-node Redis Deployment (replaced by StatefulSet)
  • tls-env.yaml — Injects TLS env vars and client cert volume mounts into all services
  • service-labels.yaml — Adds app labels to Services (required for ServiceMonitor selectors)
  • pod-anti-affinity.yaml — Spreads all pods across nodes via preferredDuringSchedulingIgnoredDuringExecution

Image tags:

images:
- name: cordum-context-engine
newTag: v0.9.7
- name: cordum-safety-kernel
newTag: v0.9.7
- name: cordum-scheduler
newTag: v0.9.7
- name: cordum-api-gateway
newTag: v0.9.7
- name: cordum-workflow-engine
newTag: v0.9.7
- name: cordum-dashboard
newTag: v0.9.7

Replica overrides:

replicas:
- name: cordum-api-gateway
count: 2
- name: cordum-safety-kernel
count: 2
- name: cordum-scheduler
count: 2

3. Secrets Management

Required Secrets

SecretKeysPurpose
cordum-api-keyAPI_KEYAPI authentication
cordum-admin-credsCORDUM_ADMIN_USERNAME, CORDUM_ADMIN_PASSWORD, CORDUM_ADMIN_EMAILUser auth (optional)
cordum-nats-server-tlstls.crt, tls.key, ca.crtNATS server TLS
cordum-redis-server-tlstls.crt, tls.key, ca.crtRedis server TLS
cordum-client-tlstls.crt, tls.key, ca.crtClient certs for services connecting to NATS/Redis
cordum-ingress-tlstls.crt, tls.keyIngress TLS termination

Creating TLS Secrets

Generate a CA and service certificates (example using openssl):

# Generate CA
openssl genrsa -out ca.key 4096
openssl req -new -x509 -key ca.key -sha256 -days 3650 \
-subj "/CN=Cordum Internal CA" -out ca.crt

# Generate NATS server cert
openssl genrsa -out nats.key 2048
openssl req -new -key nats.key -subj "/CN=nats" \
-addext "subjectAltName=DNS:nats,DNS:cordum-nats,DNS:cordum-nats-0.cordum-nats.cordum.svc,DNS:cordum-nats-1.cordum-nats.cordum.svc,DNS:cordum-nats-2.cordum-nats.cordum.svc,DNS:*.cordum-nats.cordum.svc" \
-out nats.csr
openssl x509 -req -in nats.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
-days 365 -sha256 -copy_extensions copyall -out nats.crt

# Generate Redis server cert
openssl genrsa -out redis.key 2048
openssl req -new -key redis.key -subj "/CN=redis" \
-addext "subjectAltName=DNS:redis,DNS:cordum-redis,DNS:*.cordum-redis.cordum.svc" \
-out redis.csr
openssl x509 -req -in redis.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
-days 365 -sha256 -copy_extensions copyall -out redis.crt

# Generate client cert (used by all Cordum services)
openssl genrsa -out client.key 2048
openssl req -new -key client.key -subj "/CN=cordum-client" -out client.csr
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
-days 365 -sha256 -out client.crt

# Create Kubernetes secrets
kubectl create secret generic cordum-nats-server-tls --namespace cordum \
--from-file=tls.crt=nats.crt --from-file=tls.key=nats.key --from-file=ca.crt=ca.crt

kubectl create secret generic cordum-redis-server-tls --namespace cordum \
--from-file=tls.crt=redis.crt --from-file=tls.key=redis.key --from-file=ca.crt=ca.crt

kubectl create secret generic cordum-client-tls --namespace cordum \
--from-file=tls.crt=client.crt --from-file=tls.key=client.key --from-file=ca.crt=ca.crt

For cert-manager automation, create Certificate resources referencing a ClusterIssuer:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: cordum-nats-server-tls
namespace: cordum
spec:
secretName: cordum-nats-server-tls
issuerRef:
name: cordum-ca-issuer
kind: ClusterIssuer
commonName: nats
dnsNames:
- nats
- cordum-nats
- "*.cordum-nats.cordum.svc"

Certificate Rotation

  1. Regenerate certs (or let cert-manager auto-renew)
  2. Update the Secret: kubectl create secret generic ... --dry-run=client -o yaml | kubectl apply -f -
  3. Restart affected pods: kubectl rollout restart deployment -n cordum

4. NATS Clustering

The production overlay deploys a 3-node NATS cluster as a StatefulSet.

Configuration

  • Replicas: 3 (cordum-nats-0, cordum-nats-1, cordum-nats-2)
  • Headless Service: cordum-nats (DNS-based peer discovery)
  • JetStream: Enabled with 20Gi persistent volume per node
  • TLS: Full mTLS between clients and cluster peers
  • Cluster routes: Hardcoded in ConfigMap for deterministic discovery
  • Monitoring: Port 8222 exposed for /healthz probes and metrics
  • Anti-affinity: Pods spread across nodes via preferredDuringSchedulingIgnoredDuringExecution

NATS ConfigMap (production)

port: 4222
http: 8222
jetstream {
store_dir: /data/jetstream
sync_interval: "1s"
}
tls {
cert_file: /etc/nats/tls/tls.crt
key_file: /etc/nats/tls/tls.key
ca_file: /etc/nats/tls/ca.crt
verify: true
}
cluster {
name: cordum
port: 6222
routes = [
nats://cordum-nats-0.cordum-nats.cordum.svc:6222
nats://cordum-nats-1.cordum-nats.cordum.svc:6222
nats://cordum-nats-2.cordum-nats.cordum.svc:6222
]
tls { ... }
}

JetStream Durability

  • sync_interval: "1s" — fsync every second (trade-off: lower = more durable, slower)
  • Streams are replicated across all 3 nodes (NATS_JS_REPLICAS=3 in the TLS patch)
  • PVCs: 20Gi ReadWriteOnce per node

Tuning

ParameterDefaultNotes
sync_interval1sLower for stricter durability, higher for throughput
NATS_JS_REPLICAS3Must not exceed cluster size
Storage per node20GiIncrease for high-volume deployments

5. Redis Clustering

The production overlay deploys a 6-node Redis cluster (3 masters + 3 replicas).

Configuration

  • Replicas: 6 (StatefulSet cordum-redis)
  • Headless Service: cordum-redis (DNS-based discovery)
  • TLS: Full TLS with tls-auth-clients yes
  • Persistence: AOF enabled (appendonly yes), 20Gi PVC per node
  • Cluster mode: cluster-enabled yes, cluster-node-timeout 5000
  • Exporter sidecar: oliver006/redis_exporter:v1.58.0 on port 9121

Cluster Init Job

After all 6 Redis pods are running, the cordum-redis-cluster-init Job must complete once:

# Check pod readiness
kubectl get pods -n cordum -l app=redis

# The init job runs automatically — check its status
kubectl get job cordum-redis-cluster-init -n cordum
kubectl logs job/cordum-redis-cluster-init -n cordum

The init job:

  1. Waits for all 6 nodes to respond to PING
  2. Runs redis-cli --cluster create ... --cluster-replicas 1 --cluster-yes

Pre-Flight Checks

Before running or re-running the init job, verify:

# All 6 pods must be Running and Ready
kubectl get pods -l app=redis -n cordum
# Expected: 6/6 pods in Running state with READY 2/2 (redis + exporter)

# TLS secret must exist
kubectl get secret cordum-client-tls -n cordum
kubectl get secret cordum-redis-server-tls -n cordum

# Password secret must be non-empty
kubectl get secret cordum-redis-secret -n cordum -o jsonpath='{.data.REDIS_PASSWORD}' | base64 -d | wc -c
# Expected: non-zero length

Troubleshooting Init Failures

SymptomCauseFix
Job stuck in PendingMissing TLS secretCreate cordum-client-tls secret
Could not connect to RedisPods not readyWait for all 6 pods, check startup probes
ERR Invalid node addressDNS not resolvingVerify headless service cordum-redis exists
ERR Nodes don't agree about configurationPartial previous initDelete all pods (kubectl delete pods -l app=redis -n cordum), wait for restart, re-run

Re-running: Delete the job and re-create it if you need to re-initialize:

kubectl delete job cordum-redis-cluster-init -n cordum
kubectl apply -k deploy/k8s/production

For a complete key inventory, DR runbooks, and base-to-production migration, see Redis Operations Guide.

Client Connection

All Cordum services connect to Redis cluster via:

REDIS_CLUSTER_ADDRESSES=cordum-redis-0.cordum-redis.cordum.svc:6379,...,cordum-redis-5.cordum-redis.cordum.svc:6379
REDIS_URL=rediss://redis:6379
REDIS_TLS_CA=/etc/cordum/tls/client/ca.crt
REDIS_TLS_CERT=/etc/cordum/tls/client/tls.crt
REDIS_TLS_KEY=/etc/cordum/tls/client/tls.key

6. Network Policies

The production overlay defines strict ingress and egress rules per service.

Ingress Rules

TargetAllowed SourcesPorts
NATSAll Cordum services + NATS peers4222 (client), 6222 (cluster), 8222 (monitor)
RedisAll Cordum services + Redis peers6379 (client), 16379 (cluster bus)

Egress Rules

SourceAllowed DestinationsPorts
API GatewayNATS, Redis, Safety Kernel, DNS4222, 6379, 50051, 53
SchedulerNATS, Redis, Safety Kernel, DNS4222, 6379, 50051, 53
Workflow EngineNATS, Redis, DNS4222, 6379, 53
Safety KernelNATS, Redis, DNS4222, 6379, 53
DashboardAPI Gateway, DNS8081, 53

Traffic Flow Diagram

┌──────────┐
Internet ──────►│ Ingress │
└────┬─────┘

┌──────────┼──────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ API Gateway │ │ Dashboard │
│ :8081 :8080 │◄───│ :8080 │
│ :9092 │ └───────────────┘
└──┬────┬───┬───┘
│ │ │
┌────┘ │ └────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌──────────────┐
│ NATS │ │ Redis │ │Safety Kernel │
│ :4222 │ │ :6379 │ │ :50051 │
│ :6222 │ │:16379 │ └──────┬───┬───┘
└───┬───┘ └───┬───┘ │ │
│ │ ┌───┘ └───┐
▼ ▼ ▼ ▼
┌──────────┐ ┌───────┐ ┌───────┐
│Scheduler │ │ NATS │ │ Redis │
│ :9090 │ └───────┘ └───────┘
└──┬───┬───┘
│ │
▼ ▼
┌───────┐ ┌───────┐
│ NATS │ │ Redis │
└───────┘ └───────┘

┌─────────────────┐
│ Workflow Engine │──► NATS, Redis
│ :9093 │
└─────────────────┘

┌─────────────────┐
│ Context Engine │──► Redis
│ :50070 │
└─────────────────┘

7. Ingress Configuration

Dev Ingress (no TLS)

kubectl apply -f deploy/k8s/ingress.yaml

Routes /api/v1/* and /health to the gateway, / to the dashboard. No TLS.

Production Ingress (TLS)

The production overlay includes production/ingress.yaml with:

annotations:
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" # WebSocket support
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
tls:
- hosts:
- cordum.example.com
secretName: cordum-ingress-tls

Before applying, update the hostname from cordum.example.com to your actual domain.

Create the Ingress TLS secret:

kubectl create secret tls cordum-ingress-tls --namespace cordum \
--cert=cordum-ingress.crt --key=cordum-ingress.key

Annotations for Other Ingress Controllers

Traefik:

annotations:
traefik.ingress.kubernetes.io/router.tls: "true"
traefik.ingress.kubernetes.io/router.entrypoints: websecure

Istio (VirtualService): Use an Istio Gateway + VirtualService instead of the Ingress resource.


8. Monitoring

The production overlay deploys Prometheus Operator CRDs.

Prerequisite: Install the Prometheus Operator or kube-prometheus-stack Helm chart.

ServiceMonitors

TargetPortPathInterval
cordum-api-gatewaymetrics (9092)/metrics30s
cordum-schedulermetrics (9090)/metrics30s
cordum-nats-monitormonitor (8222)/metrics30s
cordum-redis (exporter)metrics (9121)/metrics30s

Alert Rules (PrometheusRule)

AlertExpressionSeverityFor
CordumGatewayDownsum(up{service="cordum-api-gateway"}) == 0critical5m
CordumSchedulerDownsum(up{service="cordum-scheduler"}) == 0critical5m
CordumNATSDownsum(up{service="cordum-nats-monitor"}) == 0critical5m
CordumRedisDownsum(up{service="cordum-redis"}) == 0critical5m

Key Metrics

MetricSourceDescription
cordum_jobs_dispatched_totalSchedulerJobs dispatched by pool/type
cordum_jobs_duration_secondsSchedulerJob latency histogram
cordum_safety_evaluations_totalGatewayPolicy evaluations by decision
cordum_http_requests_totalGatewayHTTP requests by method/path/status

9. Scaling

HorizontalPodAutoscalers

ServiceMinMaxCPU TargetMemory Target
cordum-api-gateway21070%80%
cordum-scheduler21070%80%

Scaling Recommendations

ServiceScale WhenNotes
API GatewayHigh HTTP request volumeStateless — scales freely
Safety KernelHigh policy evaluation loadStateless — scales freely
SchedulerLarge job backlogsStateful leader election — multiple replicas use Redis locking
Workflow EngineMany concurrent workflow runsSingle instance recommended unless using Redis-based coordination
Context EngineHigh context read/write volumeStateless — scales freely
DashboardHigh user trafficStatic assets — scales freely
NATSN/AFixed 3-node cluster; increase storage, not replicas
RedisN/AFixed 6-node cluster; increase storage or reshard

PodDisruptionBudgets

All critical services have PDBs with maxUnavailable: 1:

  • cordum-api-gateway
  • cordum-scheduler
  • cordum-workflow-engine
  • cordum-safety-kernel
  • cordum-dashboard

Infrastructure StatefulSets use minAvailable to preserve quorum and data availability:

StatefulSetminAvailableRationale
NATS (3 nodes)2Maintains Raft quorum during node drains — losing 2 of 3 nodes would break consensus
Redis (6 nodes: 3 primary + 3 replica)4Ensures at least 2 primary + 2 replica survive, maintaining data availability during rolling upgrades

See Horizontal Scaling Guide for NATS delivery semantics and multi-replica considerations.


10. Backups

The production overlay includes hourly CronJob backups for both Redis and NATS.

Redis Backup

  • Schedule: 0 * * * * (hourly)
  • Method: redis-cli --rdb creates an RDB snapshot from cordum-redis-0
  • Storage: cordum-backups PVC (20Gi)
  • Retention: Last 2 successful + 2 failed jobs kept
  • File format: redis-YYYYMMDDTHHMMSSZ.rdb

NATS JetStream Backup

  • Schedule: 0 * * * * (hourly)
  • Method: nats stream snapshot for CORDUM_SYS and CORDUM_JOBS streams
  • Storage: Same cordum-backups PVC
  • File format: nats-YYYYMMDDTHHMMSSZ/{stream}.snapshot

Restore Procedures

Redis:

# Stop the cluster, copy RDB to data dir, restart
kubectl cp backup/redis-20260213T120000Z.rdb cordum/cordum-redis-0:/data/dump.rdb
kubectl delete pod cordum-redis-0 -n cordum # Pod restarts and loads RDB

NATS:

# Restore a stream snapshot
kubectl exec -n cordum cordum-nats-0 -- nats stream restore CORDUM_JOBS /backup/nats-.../CORDUM_JOBS.snapshot

11. Upgrade Procedures

Rolling Updates

All Cordum Deployments use the default RollingUpdate strategy. To upgrade:

  1. Update image tags in production/kustomization.yaml
  2. Re-apply: kubectl apply -k deploy/k8s/production
  3. Watch rollout: kubectl rollout status deployment/cordum-api-gateway -n cordum

Pre-Upgrade Checklist

  1. Back up Redis and NATS (or verify CronJob backups are recent)
  2. Review changelog for breaking changes
  3. Check that PDBs will allow the rolling update given current replica counts
  4. If config schema changed, update ConfigMaps before rolling deployments

Rollback

kubectl rollout undo deployment/cordum-api-gateway -n cordum
kubectl rollout undo deployment/cordum-scheduler -n cordum
kubectl rollout undo deployment/cordum-safety-kernel -n cordum
kubectl rollout undo deployment/cordum-workflow-engine -n cordum
kubectl rollout undo deployment/cordum-context-engine -n cordum
kubectl rollout undo deployment/cordum-dashboard -n cordum

12. Troubleshooting

Common Issues

SymptomLikely CauseFix
ImagePullBackOffMissing image or wrong tagCheck kustomization.yaml image tags and registry access
CrashLoopBackOff on gatewayMissing API_KEY secretCreate cordum-api-key secret
CrashLoopBackOff on schedulerCan't reach NATS/RedisCheck service DNS, network policies, TLS certs
Redis cluster init job stuckPods not ready yetWait for all 6 Redis pods, then delete/recreate the job
NATS connection refusedTLS mismatchVerify SANs on certs match service DNS names
OOMKilledMemory limit too lowIncrease resources.limits.memory in the Deployment
ServiceMonitor not scrapedMissing service labelsVerify service-labels.yaml patch was applied
Ingress 502 errorsGateway not readyCheck readiness probe, verify gateway pod is running

Useful Commands

# Check all pods
kubectl get pods -n cordum -o wide

# Check events (recent errors)
kubectl get events -n cordum --sort-by=.lastTimestamp | tail -20

# Logs for a specific service
kubectl logs -n cordum deployment/cordum-api-gateway --tail=100

# Check Redis cluster status
kubectl exec -n cordum cordum-redis-0 -- redis-cli --tls \
--cacert /etc/redis/tls/ca.crt --cert /etc/redis/tls/tls.crt \
--key /etc/redis/tls/tls.key cluster info

# Check NATS JetStream streams
kubectl exec -n cordum cordum-nats-0 -- nats --tlscacert /etc/nats/tls/ca.crt \
--tlscert /etc/nats/tls/tls.crt --tlskey /etc/nats/tls/tls.key stream ls

# Check network policies
kubectl get networkpolicy -n cordum

# Force config reload (delete cached config in Redis)
kubectl exec -n cordum deployment/cordum-api-gateway -- \
redis-cli -h redis -p 6379 DEL cfg:system:default

13. Helm Charts

The cordum-helm/ directory provides an alternative deployment method using Helm v3.

Chart Structure

cordum-helm/
├── Chart.yaml # Chart metadata (v0.9.7)
├── values.yaml # Default values
├── README.md # Chart documentation
└── templates/
├── _helpers.tpl # Template helpers
├── configmap.yaml # Pools, timeouts, safety config
├── configmap-nats.yaml # NATS server configuration
├── deployment-control-plane.yaml # All control plane services
├── deployment-dashboard.yaml # Dashboard deployment
├── ingress.yaml # Ingress resource
├── secret.yaml # API key + admin password
├── service.yaml # Service definitions
└── serviceaccount.yaml # ServiceAccount

Key values.yaml Overrides

KeyDefaultDescription
global.image.repositoryghcr.io/cordum-io/cordum/control-planeBase image for control plane services
global.image.tagv0.9.7Image tag for all services
secrets.apiKey""API key (required)
secrets.adminPassword""Admin password for user auth
nats.enabledtrueDeploy bundled NATS
nats.persistence.enabledfalseEnable JetStream persistence
redis.enabledtrueDeploy bundled Redis
redis.persistence.enabledfalseEnable Redis persistence
external.natsUrl""Use external NATS (disables bundled)
external.redisUrl""Use external Redis (disables bundled)
external.safetyKernelAddr""Use external safety kernel
gateway.replicaCount1Gateway replicas
gateway.env.apiRateLimitRps50API rate limit
gateway.env.userAuthEnabledfalseEnable user/password auth
scheduler.replicaCount1Scheduler replicas
ingress.enabledfalseCreate Ingress resource
ingress.className""Ingress class (nginx, traefik)
ingress.tls[]TLS configuration

Install

# Basic install
helm install cordum ./cordum-helm \
--namespace cordum --create-namespace \
--set secrets.apiKey="$(openssl rand -hex 32)"

# With persistence and ingress
helm install cordum ./cordum-helm \
--namespace cordum --create-namespace \
--set secrets.apiKey="$(openssl rand -hex 32)" \
--set nats.persistence.enabled=true \
--set redis.persistence.enabled=true \
--set ingress.enabled=true \
--set ingress.className=nginx

Production Values Override

Create a values-production.yaml file:

# values-production.yaml
global:
image:
tag: "v0.9.7"

secrets:
apiKey: "" # Set via --set or external secret

# Use external managed services
nats:
enabled: false
external:
natsUrl: "nats://nats-cluster.infra:4222"

redis:
enabled: false
external:
redisUrl: "rediss://redis-cluster.infra:6379"

# Scale control plane
gateway:
replicaCount: 2
env:
apiRateLimitRps: 200
apiRateLimitBurst: 400
userAuthEnabled: true
resources:
limits:
cpu: 2000m
memory: 1Gi
requests:
cpu: 500m
memory: 256Mi

scheduler:
replicaCount: 2
resources:
limits:
cpu: 2000m
memory: 1Gi
requests:
cpu: 500m
memory: 256Mi

safetyKernel:
replicaCount: 2
resources:
limits:
cpu: 1000m
memory: 512Mi

workflowEngine:
replicaCount: 1
resources:
limits:
cpu: 2000m
memory: 1Gi

# Enable ingress with TLS
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
tls:
- secretName: cordum-tls
hosts:
- cordum.example.com
- api.cordum.example.com
api:
host: api.cordum.example.com
dashboard:
host: cordum.example.com

Deploy with overrides:

helm install cordum ./cordum-helm \
--namespace cordum --create-namespace \
-f values-production.yaml \
--set secrets.apiKey="$(openssl rand -hex 32)" \
--set secrets.adminPassword="$(openssl rand -base64 24)"

Upgrade

helm upgrade cordum ./cordum-helm \
--namespace cordum \
-f values-production.yaml \
--set global.image.tag="v0.2.0"

Helm vs Kustomize

FeatureHelm (cordum-helm/)Kustomize (deploy/k8s/production/)
Bundled NATS/RedisToggle with nats.enabled / redis.enabledSeparate StatefulSets in overlay
External servicesexternal.natsUrl / external.redisUrlManual env patch
TLS/mTLSNot built-in (use external)Full TLS overlay with patches
Redis ClusterNot supported (single instance)6-node cluster with init job
NATS ClusterNot supported (single instance)3-node StatefulSet
Network PoliciesNot includedFull ingress + egress policies
MonitoringNot includedServiceMonitors + PrometheusRules
HA (PDB/HPA)Manual replica countPDBs + HPAs included
Best forQuick starts, managed infrastructureFull production with self-hosted infra

For production with self-hosted NATS/Redis, the kustomize overlay in deploy/k8s/production/ is recommended. Use Helm when relying on external managed services (Amazon ElastiCache, Amazon MQ, etc.).