Skip to main content

Redis Operations Guide

This document covers Redis cluster topology, key inventory, per-service dependency mapping, disaster recovery runbooks, and migration procedures.

1. Cluster Topology

Production (deploy/k8s/production/)

  • StatefulSet: cordum-redis, 6 pods (cordum-redis-0 through cordum-redis-5)
  • Layout: 3 primaries + 3 replicas (--cluster-replicas 1)
  • Hash slots: 16384 total, distributed evenly across 3 primaries (~5461 each)
  • Node timeout: 5000ms — a node unresponsive for 5s is marked as potentially failed
  • Persistence: AOF enabled (appendonly yes), 20Gi PVC per node
  • TLS: Full mutual TLS (tls-auth-clients yes), plain port disabled (port 0)
  • Exporter: oliver006/redis_exporter:v1.58.0 sidecar on port 9121
  • Anti-affinity: Preferred pod anti-affinity on kubernetes.io/hostname (best-effort spread)
  • PDB: minAvailable: 2 for primaries during node drains (see ha.yaml)
    • Note: The PDB covers all 6 pods. With minAvailable: 4, at most 2 pods can be evicted simultaneously, preserving at least 2 primaries + 2 replicas.

Base (deploy/k8s/base.yaml)

  • Deployment: Single-replica redis (not a StatefulSet, no cluster mode)
  • No TLS: Plain redis-server --appendonly yes --requirepass $REDIS_PASSWORD
  • Service: ClusterIP redis:6379
  • No exporter, no cluster init job, no PDB

DNS Names

NodeFQDN
cordum-redis-0cordum-redis-0.cordum-redis.cordum.svc.cluster.local
cordum-redis-1cordum-redis-1.cordum-redis.cordum.svc.cluster.local
cordum-redis-2cordum-redis-2.cordum-redis.cordum.svc.cluster.local
cordum-redis-3cordum-redis-3.cordum-redis.cordum.svc.cluster.local
cordum-redis-4cordum-redis-4.cordum-redis.cordum.svc.cluster.local
cordum-redis-5cordum-redis-5.cordum-redis.cordum.svc.cluster.local

2. Key Prefix Inventory

All keys follow the convention <namespace>:<type>:<id>. Keys are distributed across hash slots by CRC16 of the key (or hash tag if present).

Job Keys (job:*)

Key PatternTypeTTLPurpose
job:state:<jobID>String7dJob state (PENDING, SCHEDULED, RUNNING, etc.)
job:meta:<jobID>Hash7dJob metadata (tenant, principal, topic, labels)
job:req:<jobID>String7dSerialized job request (protobuf JSON)
job:result_ptr:<jobID>String7dPointer to result data in memory fabric
job:events:<jobID>List7dJob state transition events
job:decisions:<jobID>List7dHistorical safety decisions (JSON)
job:<jobID>:output_decisionString7dOutput policy evaluation result
job:idempotency:<tenant>:<key>String7dTenant-scoped idempotency dedup
job:recentSorted Set7dRecently updated jobs (score = timestamp)
job:deadlineSorted Set7dJobs with deadlines (score = deadline time)
job:index:<state>Sorted Set7dJobs indexed by state
job:tenant:active:<tenant>Set7dActive job IDs per tenant
trace:<traceID>Set7dJob IDs grouped by execution trace

TTL controlled by JOB_META_TTL_SECONDS (default 604800 = 7 days).

Workflow Keys (wf:*)

Key PatternTypeTTLPurpose
wf:def:<workflowID>StringNoneWorkflow definition (JSON)
wf:index:org:<orgID>Sorted SetNoneWorkflow IDs by organization
wf:index:allSorted SetNoneAll workflow IDs
wf:run:<runID>StringNoneWorkflow run document (JSON)
wf:runs:<workflowID>Sorted SetNoneRun IDs for a workflow
wf:runs:allSorted SetNoneAll run IDs
wf:runs:status:<status>Sorted SetNoneRun IDs by status
wf:runs:active:<orgID>SetNoneActive run IDs per org
wf:run:timeline:<runID>ListNoneTimeline events (capped at 1000)
wf:run:idempotency:<key>StringNoneWorkflow idempotency dedup

Config Keys (cfg:*)

Key PatternTypeTTLPurpose
cfg:system:<id>StringNoneSystem-level config (default ID: default)
cfg:org:<orgID>StringNoneOrganization config
cfg:team:<teamID>StringNoneTeam config
cfg:workflow:<workflowID>StringNoneWorkflow config
cfg:step:<stepID>StringNoneStep config

Write-once: services cache on startup. Delete cfg:system:default to force reload.

Auth Keys (auth:*, user:*, session:*)

Key PatternTypeTTLPurpose
auth:<keyID>StringNoneAPI key record (bcrypt hash, role, tenant)
auth:prefix:<prefix>SetNoneKey IDs by ck_XXXXXXXX prefix
auth:tenant:<tenant>SetNoneKey IDs per tenant
user:<tenant>:<username>StringNoneUser record (password hash, role)
user:id:<userID>StringNoneUser reference (tenant:username)
user:email:<tenant>:<email>StringNoneEmail → username index
user:tenant:<tenant>SetNoneUser IDs per tenant
session:<token>String1hSession token → auth context
login:failed:<user>:<ip>String15mPer-IP failed login counter
login:failed:global:<user>String15mGlobal failed login counter

Context / Result Keys (ctx:*, res:*)

Key PatternTypeTTLPurpose
ctx:<jobID>String24hJob context/memory window data
res:<jobID>String24hJob result data

TTL controlled by REDIS_DATA_TTL_SECONDS (default 86400 = 24h).

DLQ Keys (dlq:*)

Key PatternTypeTTLPurpose
dlq:entry:<jobID>String30dDead letter entry (JSON)
dlq:indexSorted SetNoneDLQ entries by creation time

TTL controlled by CORDUM_DLQ_ENTRY_TTL_DAYS (default 30).

Lock Keys (cordum:*)

Key PatternTypeTTLPurpose
cordum:scheduler:job:<jobID>String60sDistributed job processing lock
cordum:reconciler:defaultString2×pollReconciler leader lock
cordum:wf:run:lock:<runID>String30sWorkflow run operation lock
cordum:workflow-engine:reconciler:defaultString2×pollWorkflow reconciler lock
cordum:scheduler:snapshot:writerString10sSnapshot writer leader lock
cordum:dlq:cleanupString1hDLQ cleanup leader lock
cordum:wf:delay:timersSorted SetNoneDelay timers (score = fire time)

Rate Limit Keys

Key PatternTypeTTLPurpose
cordum:rl:<tenant>:<unixSec>String2sSliding window rate limit counter

Circuit Breaker Keys

Key PatternTypeTTLPurpose
cordum:cb:safety:failuresString30sInput safety circuit breaker counter
cordum:cb:safety:output:failuresString30sOutput safety circuit breaker counter

Worker / System Keys

Key PatternTypeTTLPurpose
sys:workers:snapshotStringNoneJSON snapshot of all registered workers

3. Per-Service Redis Dependency

ServiceKey Prefixes UsedImpact if Redis DownGraceful Degradation?
api-gatewayjob:*, auth:*, user:*, session:*, cfg:*, dlq:*, wf:*, cordum:rl:*Cannot authenticate, submit jobs, list state, or serve any APINo — all state is in Redis
schedulerjob:*, cordum:scheduler:*, cordum:reconciler:*, cordum:cb:*, sys:workers:*, cfg:*Cannot dispatch, lock, or transition jobs; circuit breaker falls back to localPartial — circuit breaker has local fallback
safety-kernelNone directly (policy from file/URL)No impactYes — fully independent of Redis
workflow-enginewf:*, job:*, cordum:wf:*, cfg:*Cannot execute workflows, fire timers, or coordinate runsNo — all workflow state is in Redis
context-enginectx:*, res:*Cannot store or retrieve context/resultsNo — Redis is the backing store
dashboardNone (connects via api-gateway)Loses API connectivity (shows errors)N/A — depends on gateway
cordum-mcpVia api-gatewaySame as dashboardN/A

4. Disaster Recovery Runbooks

Runbook A: Single Replica Failure

Scenario: One replica pod (cordum-redis-3/4/5) becomes unreachable.

Impact: Minimal. Reads from that replica are rerouted to the primary. No data loss.

Steps:

  1. Check pod status: kubectl get pods -l app=redis -n cordum
  2. If the pod is in CrashLoopBackOff, check logs: kubectl logs cordum-redis-N -n cordum
  3. Delete the pod to trigger restart: kubectl delete pod cordum-redis-N -n cordum
  4. Verify recovery: redis-cli --tls ... -h cordum-redis-N... CLUSTER INFOcluster_state:ok

Recovery time: Automatic, typically < 30s for pod restart.

Runbook B: Single Primary Failure

Scenario: A primary node (cordum-redis-0/1/2) becomes unreachable.

Impact: Moderate. Redis Cluster auto-promotes the replica to primary within cluster-node-timeout (5s). During failover, writes to the affected hash slots fail with CLUSTERDOWN or are redirected.

Steps:

  1. Verify auto-failover occurred:
    redis-cli --tls ... CLUSTER NODES | grep master
    # Should show 3 masters (one will be the promoted replica)
  2. When the failed node recovers, it rejoins as a replica automatically.
  3. If the node doesn't recover, delete the pod:
    kubectl delete pod cordum-redis-N -n cordum
  4. After pod restart, verify it rejoined the cluster:
    redis-cli --tls ... -h cordum-redis-N... CLUSTER NODES

Recovery time: 5-15s for automatic failover, then application reconnection.

Runbook C: Two or More Nodes Lost

Scenario: Two or more nodes are lost simultaneously (e.g., node drain of co-located pods).

Impact: If both a primary and its replica for the same slot range are lost, the cluster enters FAIL state and cannot serve those slots.

Steps:

  1. Check cluster state:
    redis-cli --tls ... CLUSTER INFO
    # cluster_state:fail means slots are not covered
  2. Check which slots are unassigned:
    redis-cli --tls ... CLUSTER SLOTS
  3. If pods are recoverable, wait for restart:
    kubectl get pods -l app=redis -n cordum -w
  4. If slots are still uncovered after pod recovery, re-add the node:
    redis-cli --tls ... CLUSTER MEET <recovered-node-ip> 6379
  5. Verify all 16384 slots are assigned:
    redis-cli --tls ... CLUSTER INFO | grep cluster_slots_ok
    # Expected: cluster_slots_ok:16384

Prevention: PDB (minAvailable: 4) prevents Kubernetes from evicting more than 2 pods during voluntary disruptions.

Runbook D: Full Cluster Loss

Scenario: All 6 Redis pods are lost (e.g., namespace deletion, persistent storage failure).

Impact: Total — all Cordum state is lost. Services return HTTP 500.

Steps:

  1. Re-create the StatefulSet if deleted:
    kubectl apply -k deploy/k8s/production
  2. Wait for all 6 pods:
    kubectl get pods -l app=redis -n cordum -w
  3. Re-run the cluster init job:
    kubectl delete job cordum-redis-cluster-init -n cordum 2>/dev/null
    kubectl apply -k deploy/k8s/production
    kubectl wait --for=condition=complete job/cordum-redis-cluster-init -n cordum --timeout=120s
  4. Restore from latest multi-shard backup:
    # List available backups
    kubectl exec -it <any-pod-with-backup-pvc> -n cordum -- ls -la /backup/redis-*.rdb

    # For each primary, restore the corresponding shard:
    # Stop the cluster, copy RDB files to /data on each primary, restart.
    # WARNING: This is a manual process — coordinate with the team.
  5. Restart all Cordum services to re-establish connections:
    kubectl rollout restart deployment -n cordum

Recovery time: 10-30 minutes depending on data volume.

Runbook E: Split-Brain

Scenario: Network partition causes the cluster to split into two groups, each believing the other has failed.

Impact: Both sides may promote replicas, leading to conflicting writes. After partition heals, data on the minority side is lost (Redis uses last-failover-wins).

Steps:

  1. Identify the partition:
    # Run on each node
    redis-cli --tls ... -h cordum-redis-N... CLUSTER NODES
    # Compare views — nodes on each side will show the other side as "fail"
  2. Wait for the network to heal. Redis will auto-resolve once connectivity is restored.
  3. After resolution, check for data consistency:
    redis-cli --tls ... DBSIZE # On each primary
  4. If data was lost on the minority side, affected keys must be re-created by the application (jobs re-submitted, etc.).

Prevention: Use pod anti-affinity to spread Redis pods across nodes/zones.

Runbook F: Stuck Slot Migration

Scenario: A slot is stuck in migrating or importing state after a failed rebalance operation.

Steps:

  1. Identify stuck slots:
    redis-cli --tls ... CLUSTER NODES | grep -E "migrating|importing"
  2. Fix the stuck slot:
    redis-cli --tls ... CLUSTER SETSLOT <slot> STABLE
    # Run on BOTH the source and destination node
  3. Verify:
    redis-cli --tls ... CLUSTER INFO | grep cluster_slots_ok
    # Expected: 16384

Runbook G: Cluster Init Job Failure

Scenario: The cordum-redis-cluster-init Job fails during initial deployment.

Common causes:

  • Not all 6 pods are ready yet (the init job polls with PING but may time out)
  • TLS certificates not mounted or incorrect
  • REDIS_PASSWORD secret is empty

Steps:

  1. Check job logs:
    kubectl logs job/cordum-redis-cluster-init -n cordum
  2. Verify all pods are ready:
    kubectl get pods -l app=redis -n cordum
    # All 6 must be Running and Ready (1/1 or 2/2 with exporter)
  3. Verify TLS secret:
    kubectl get secret cordum-client-tls -n cordum
  4. Delete and re-run:
    kubectl delete job cordum-redis-cluster-init -n cordum
    kubectl apply -k deploy/k8s/production

5. Base to Production Migration

This procedure migrates from the single-node base.yaml Redis Deployment to the 6-node production Redis Cluster.

Pre-Flight Checklist

  • TLS certificates generated (cordum-redis-server-tls, cordum-client-tls secrets)
  • cordum-redis-secret has a strong REDIS_PASSWORD
  • PVCs provisioned (6 × 20Gi for data + 20Gi for backups)
  • Application downtime window scheduled (migration requires a service restart)

Migration Steps

  1. Export data from single-node Redis (optional if starting fresh):

    kubectl exec -it redis-0 -n cordum -- redis-cli BGSAVE
    kubectl cp cordum/redis-0:/data/dump.rdb ./dump.rdb
  2. Scale down all Cordum services to prevent writes during migration:

    kubectl scale deployment --all --replicas=0 -n cordum
  3. Apply production overlay (creates StatefulSet, headless service, init job):

    kubectl apply -k deploy/k8s/production
  4. Wait for cluster init to complete:

    kubectl wait --for=condition=complete job/cordum-redis-cluster-init -n cordum --timeout=300s
  5. Verify cluster health:

    kubectl exec -it cordum-redis-0 -n cordum -- redis-cli --tls \
    --cacert /etc/redis/tls/ca.crt --cert /etc/redis/tls/tls.crt --key /etc/redis/tls/tls.key \
    CLUSTER INFO
    # Expect: cluster_state:ok, cluster_slots_ok:16384, cluster_known_nodes:6
  6. Update service environment variables to use cluster addresses:

    REDIS_CLUSTER_ADDRESSES=cordum-redis-0.cordum-redis.cordum.svc:6379,...,cordum-redis-5.cordum-redis.cordum.svc:6379
  7. Delete the old single-node Deployment:

    kubectl delete deployment redis -n cordum
    kubectl delete service redis -n cordum
  8. Scale Cordum services back up:

    kubectl scale deployment --all --replicas=1 -n cordum
    # Or apply your desired replica counts
  9. Verify:

    • Submit a test job and verify it completes
    • Check dashboard connectivity
    • Verify backup CronJob triggers on schedule

Rollback

If the migration fails, scale down services, delete the production overlay resources, and re-apply base.yaml. The single-node Redis will start fresh (data from the cluster is not backward-compatible with single-node).

6. Monitoring & Alerts

Six Redis cluster alerts are defined in deploy/k8s/production/monitoring.yaml:

AlertExpressionForSeverity
CordumRedisClusterDegradedmin(redis_cluster_state) == 01mcritical
CordumRedisClusterNodeDowncount(up{service="cordum-redis"} == 1) < 65mwarning
CordumRedisClusterSlotsIncompletemin(redis_cluster_slots_ok) < 163842mcritical
CordumRedisMemoryHighmax(redis_memory_used / redis_memory_max) > 0.810mwarning
CordumRedisReplicationBrokenmin(redis_connected_slaves{role="master"}) < 15mwarning
CordumRedisBackupFailedkube_job_status_failed{job_name=~"cordum-redis-backup.*"} > 01hwarning

All metrics are exposed by the existing redis_exporter sidecar.

7. Backups

The cordum-redis-backup CronJob runs hourly and backs up all primaries (not just node-0):

  1. Iterates all 6 nodes, identifies primaries via INFO replication
  2. Runs redis-cli --rdb against each primary: redis-{timestamp}-node{N}.rdb
  3. Tolerates individual node failures (partial backup is better than no backup)
  4. Exits non-zero only if zero primaries were backed up

Retention: 7 days (cordum-backup-retention CronJob runs at 03:30 UTC daily).

Cross-References