Cordum System Overview (current code)
This document describes the current architecture and runtime behavior of Cordum as implemented in this repository. It is intended to be code-accurate and is the primary reference for how the control plane and external workers interact.
High-level architecture
Clients/UI
|
v
API Gateway (HTTP/WS + gRPC)
| writes ctx/res/artifact pointers
v
Redis (ctx/res/artifacts, job meta, workflows, config, DLQ, schemas, locks)
|
v
NATS bus (sys.* + job.* + worker.<id>.jobs)
|
+--> Scheduler (safety gate + routing + job state)
| |
| +--> Safety Kernel (gRPC policy check)
|
+--> External Workers (user-provided)
|
+--> Workflow Engine (run orchestration)
Core components
-
API gateway (
core/controlplane/gateway,cmd/cordum-api-gateway; binarycordum-api-gateway)- HTTP/WS endpoints for jobs, workflows/runs, approvals, config, policy (bundles + publish/rollback/audit), DLQ, schemas, locks, artifacts, workers, traces, packs.
- Marketplace endpoints for pack discovery/installs (gateway seeds
cfg:system:pack_catalogswith the official catalog; override via env or config). - gRPC service (
CordumApi) for job submit/status. - Submit-time policy evaluation: both HTTP and gRPC submit paths call the Safety Kernel before persisting state or publishing to the bus. Policy deny returns 403/PermissionDenied, throttle returns 429/ResourceExhausted, and require_human creates the job in APPROVAL state without publishing. When the Safety Kernel is unavailable,
POLICY_CHECK_FAIL_MODEcontrols behavior:closed(default) rejects the job,openallows it. - Streams
BusPacketevents over/api/v1/stream(protojson). WebSocket connections use ping/pong keepalive (30s interval), credential revalidation (120s), uniqueconn_idtracking, and structured lifecycle logging with disconnect reasons. - Live
/healthendpoint checks NATS connectivity and Redis pool health, returning JSON{status, nats, redis}. - Policy governance:
/api/v1/policy/replay(what-if replay) and/api/v1/policy/analytics(rule hit analytics, override rates). - Approval context:
/api/v1/approvals/{job_id}/contextreturns enriched decision briefing with blast radius, prior approvals, rollback hints, and time-remaining. - Enforces API key + tenant headers and CORS allowlist if configured (HTTP
X-API-Key+X-Tenant-ID, gRPC metadatax-api-key, WSSec-WebSocket-Protocol: cordum-api-key, <base64url>+?tenant_id=<tenant>). - OSS auth uses an API key allowlist (
CORDUM_API_KEYS,CORDUM_API_KEY, orCORDUM_API_KEYS_PATH) with optional role/tenant metadata and a single-tenant default (TENANT_ID, defaultdefault). HTTP requests must supplyX-Tenant-ID. - Multi-tenant API keys and RBAC enforcement are provided by the enterprise auth provider (enterprise repo).
- Enterprise add-ons are delivered from the enterprise repo; this repo stays platform-only.
-
Dashboard (
dashboard/)- React UI served via Nginx; connects to
/api/v1and/api/v1/stream. - Runtime config via
/config.json(API base URL, API key, tenant, optional principal for enterprise auth). - Governance pages:
/govern/replay— policy replay (what-if analysis: replay historical jobs against candidate policies)./govern/analytics— policy rule analytics (hit counts, override rates, approval latency histograms)./approvals/:jobId— approval detail with 6 context sections: what, why blocked, blast radius, risk, prior history, rollback guidance.
- React UI served via Nginx; connects to
-
Scheduler (
core/controlplane/scheduler,cmd/cordum-scheduler; binarycordum-scheduler)- Subscribes to
sys.job.submit,sys.job.result,sys.job.cancel,sys.heartbeat. - Calls Safety Kernel before dispatch (allow/deny/approve/throttle/constraints). When the Safety Kernel is unreachable,
WithInputFailModecontrols behavior:closed(default) denies the job,openallows it. Both gateway and scheduler usePOLICY_CHECK_FAIL_MODE— the gateway evaluates policy at submit time, the scheduler at dispatch time. - Routes jobs using pool mapping + least-loaded strategy, labels, and requires-based pool eligibility.
- Persists job state in Redis and emits DLQ for non-success results.
- Reconciler marks stale
DISPATCHED/RUNNINGjobs asTIMEOUT. - Pending replayer retries
PENDINGjobs past the dispatch timeout to avoid stuck runs.
- Subscribes to
-
Safety Kernel (
core/controlplane/safetykernel,cmd/cordum-safety-kernel; binarycordum-safety-kernel)- gRPC
Check,Evaluate,Explain,Simulate; usesconfig/safety.yaml. - Deny/allow by tenant/topic, plus MCP allow/deny lists and constraints.
- Loads policy bundles from file/URL plus config-service fragments (supports bundle
enabled=false), with snapshot hashing and hot reload. - Applies effective config embedded in job env.
- Returns optional remediations; gateway can apply them to create a new job with safer topic/capability/labels.
- Optional decision cache (
SAFETY_DECISION_CACHE_TTL) keeps latency low for repeated checks. - Server-side risk tag derivation: packs can declare a
riskTagDeriveron topics. When registered, the kernel derives authoritative risk tags from job content (_content.payload_jsonlabel) instead of trusting client-supplied tags. Prevents risk tag spoofing. Built-in derivers:amount-threshold(parsesamountfrom JSON payload, used by mock-bank).
- gRPC
-
Workflow Engine (
core/workflow,cmd/cordum-workflow-engine; binarycordum-workflow-engine)- Stores workflow definitions and runs in Redis; maintains run timeline.
- Dispatches ready steps as jobs (
sys.job.submit). - Supports condition, delay, notify, for_each fan-out, retries/backoff, approvals, run cancel.
depends_onenables DAG execution: independent steps run in parallel; steps wait for all deps to succeed.- Failed/cancelled/timed-out/denied deps block downstream steps (no implicit continue-on-error).
- Denied is a first-class terminal status:
JOB_STATUS_DENIEDmaps toStepStatusDeniedand propagates toRunStatusDenied(notRunStatusFailed). Denied steps supporton_errorrecovery chains just like failed steps. The status pipeline reports denied in its own bucket, separate from failed. - Supports rerun-from-step and dry-run mode.
- Validates workflow input and step input/output schemas.
- Subscribes to
sys.job.resultto advance runs; reconciler retries stuck runs.
-
Context Engine (
core/contextwindow/engine,cmd/cordum-context-engine; binarycordum-context-engine)- gRPC service for
BuildWindowandUpdateMemory. - Maintains chat history and generic memory under
mem:<memory_id>:*.
- gRPC service for
-
MCP Server (
cmd/cordum-mcp; binarycordum-mcp)- Model Context Protocol bridge that exposes Cordum capabilities (jobs, workflows, policy) as MCP tools and resources.
-
Licensing (
core/licensing/)- Ed25519-signed license loading and validation with three tiers (Community/Team/Enterprise).
- Entitlement enforcement across services: gateway rate limits, scheduler concurrency caps, workflow step limits, safety kernel policy bundle quotas, audit retention periods.
- Graceful degradation to Community tier on license expiry.
- Config:
CORDUM_LICENSE_FILE,CORDUM_LICENSE_TOKEN,CORDUM_LICENSE_PUBLIC_KEY.
-
Telemetry (
core/telemetry/)- Structured metrics collection across all services; Prometheus exposition on each service's metrics port.
- License-tier-aware retention and export controls.
-
External workers (not in this repo)
- Subscribe to job topics or direct subjects; honor
sys.job.cancel. - Write results to Redis and publish
sys.job.result. - Use the CAP runtime in
sdk/runtimefor typed handlers + pointer hydration; use CAP worker helpers for heartbeats/progress/cancel when needed. - MCP integration lives in
cordum-packs/packs/mcp-bridge, which exposes MCP tools/resources over stdio and submits MCP tool calls as Cordum jobs.
- Subscribe to job topics or direct subjects; honor
Job lifecycle (single job)
- Client or gateway writes input JSON to Redis at
ctx:<job_id>.- Before persisting, the gateway evaluates submit-time policy via the Safety Kernel. Jobs denied by policy are rejected immediately (HTTP 403 / gRPC PermissionDenied) and never reach the bus or scheduler. Throttled jobs receive 429 / ResourceExhausted. Jobs requiring approval are created in APPROVAL state without publishing.
- Publish
BusPacket{JobRequest}tosys.job.submitwithcontext_ptr. - Scheduler:
- Sets job state
PENDING, resolves effective config, runs safety check. - Picks a subject (
worker.<id>.jobsorjob.*) and dispatches. - Pending replayer replays old
PENDINGjobs past the dispatch timeout. - If approval is required, state becomes
APPROVAL_REQUIRED; approvals are bound to the policy snapshot + job hash before requeueing. - If remediations are returned, the gateway can apply one via
POST /api/v1/jobs/{id}/remediate(creates a new job).
- Sets job state
- Worker:
- Loads context from
context_ptr, runs work, writesres:<job_id>. - Publishes
BusPacket{JobResult}tosys.job.result.
- Loads context from
- Scheduler:
- Updates terminal state and stores
result_ptr. - Emits DLQ entry for terminal failures except
FAILED_RETRYABLE. FAILED_FATALtriggers saga rollback (compensation stack).
- Updates terminal state and stores
- Reconciler marks stale jobs
TIMEOUTbased onconfig/timeouts.yaml. - Cancellation: API or workflow engine publishes
BusPacket{JobCancel}tosys.job.cancel; workers cancel in-flight jobs.
Workflow runs
- Workflows are defined in Redis (
core/workflow). - A run is created via
/api/v1/workflows/{id}/runs. - Steps are dispatched as jobs using job IDs
run_id:step_id@attempt. - Step input supports simple expressions (
core/workflow/eval.go) and template expansion. - for_each steps fan out child jobs with
foreach_indexandforeach_itemenv fields. - Approval steps pause the run until
/approveis called. - Runs and workflows can be deleted via
DELETE /api/v1/workflow-runs/{id}andDELETE /api/v1/workflows/{id}. - Runs support idempotency keys via
Idempotency-Keyheader on run creation.
Protocols
- Bus and safety messages are CAP v2 types (no local duplicates):
BusPacket,JobRequest,JobResult,Heartbeat,PolicyCheck*- See
github.com/cordum-io/cap/v2/cordum/agent/v1.
- Local gRPC APIs:
CordumApi(submit job, get status) incore/protocol/proto/v1/api.proto(gRPC service name).ContextEngineincore/protocol/proto/v1/context.proto
- Generated Go types live in
core/protocol/pb/v1.
Bus subjects and delivery
Subjects:
sys.job.submit,sys.job.result,sys.job.progress,sys.job.dlq,sys.job.cancel,sys.heartbeat,sys.workflow.eventjob.*pool subjectsworker.<id>.jobsdirect worker subjects
JetStream (optional):
- Enable with
NATS_USE_JETSTREAM=1. - Durable subjects:
sys.job.submit,sys.job.result,sys.job.dlq,job.*,worker.<id>.jobs. - Best-effort:
sys.heartbeat,sys.job.cancel,sys.job.progress,sys.workflow.event. - Handlers are idempotent via Redis locks and retryable error NAKs.
Redis key map (selected)
- Context/result:
ctx:<job_id>-> input payloadres:<job_id>-> result payloadart:<id>-> artifact payload
- Job store:
job:meta:<job_id>(state + metadata)job:state:<job_id>(state)job:recent(sorted set)job:index:<state>(sorted sets for reconciliation)job:deadline(sorted set of deadlines)job:events:<job_id>(state transition log)trace:<trace_id>(set of job ids)
- Context engine:
mem:<memory_id>:events,mem:<memory_id>:chunks,mem:<memory_id>:summary
- Workflow engine:
wf:def:<workflow_id>(definitions)wf:run:<run_id>plus run indexes (wf:runs:*)wf:run:timeline:<run_id>(append-only timeline)wf:run:idempotency:<key>(idempotency mapping)
- DLQ:
dlq:entry:<job_id>,dlq:index
- Config service:
cfg:<scope>:<id>cfg:system:policy(policy fragments bundle)cfg:system:packs(installed pack registry)
- Schema registry:
schema:<id>,schema:index
- Locks:
lock:<key>(plus owner/ttl metadata)
Binaries (cmd)
cordum-api-gatewaycordum-schedulercordum-safety-kernelcordum-workflow-enginecordum-context-enginecordumctl(CLI)cordum-mcp(MCP server)
Repo layout
core/control plane, infra, protocols, workflow engine.cmd/platform binaries.
Topics -> pools
See config/pools.yaml for the full map. Topics are config-driven; no core topics are enforced.
Observability
- Scheduler metrics:
:9090/metrics - API gateway metrics:
:9092/metrics - Workflow engine health:
:9093/health
Testing
- Run
go test ./...(useGOCACHE=$(pwd)/.cache/go-buildif needed). - If modifying
.proto, runmake proto. - Platform smoke:
bash ./tools/scripts/platform_smoke.sh.