Skip to main content

Cordum System Overview (current code)

This document describes the current architecture and runtime behavior of Cordum as implemented in this repository. It is intended to be code-accurate and is the primary reference for how the control plane and external workers interact.

High-level architecture

Clients/UI
|
v
API Gateway (HTTP/WS + gRPC)
| writes ctx/res/artifact pointers
v
Redis (ctx/res/artifacts, job meta, workflows, config, DLQ, schemas, locks)
|
v
NATS bus (sys.* + job.* + worker.<id>.jobs)
|
+--> Scheduler (safety gate + routing + job state)
| |
| +--> Safety Kernel (gRPC policy check)
|
+--> External Workers (user-provided)
|
+--> Workflow Engine (run orchestration)

Core components

  • API gateway (core/controlplane/gateway, cmd/cordum-api-gateway; binary cordum-api-gateway)

    • HTTP/WS endpoints for jobs, workflows/runs, approvals, config, policy (bundles + publish/rollback/audit), DLQ, schemas, locks, artifacts, workers, traces, packs.
    • Marketplace endpoints for pack discovery/installs (gateway seeds cfg:system:pack_catalogs with the official catalog; override via env or config).
    • gRPC service (CordumApi) for job submit/status.
    • Submit-time policy evaluation: both HTTP and gRPC submit paths call the Safety Kernel before persisting state or publishing to the bus. Policy deny returns 403/PermissionDenied, throttle returns 429/ResourceExhausted, and require_human creates the job in APPROVAL state without publishing. When the Safety Kernel is unavailable, POLICY_CHECK_FAIL_MODE controls behavior: closed (default) rejects the job, open allows it.
    • Streams BusPacket events over /api/v1/stream (protojson). WebSocket connections use ping/pong keepalive (30s interval), credential revalidation (120s), unique conn_id tracking, and structured lifecycle logging with disconnect reasons.
    • Live /health endpoint checks NATS connectivity and Redis pool health, returning JSON {status, nats, redis}.
    • Policy governance: /api/v1/policy/replay (what-if replay) and /api/v1/policy/analytics (rule hit analytics, override rates).
    • Approval context: /api/v1/approvals/{job_id}/context returns enriched decision briefing with blast radius, prior approvals, rollback hints, and time-remaining.
    • Enforces API key + tenant headers and CORS allowlist if configured (HTTP X-API-Key + X-Tenant-ID, gRPC metadata x-api-key, WS Sec-WebSocket-Protocol: cordum-api-key, <base64url> + ?tenant_id=<tenant>).
    • OSS auth uses an API key allowlist (CORDUM_API_KEYS, CORDUM_API_KEY, or CORDUM_API_KEYS_PATH) with optional role/tenant metadata and a single-tenant default (TENANT_ID, default default). HTTP requests must supply X-Tenant-ID.
    • Multi-tenant API keys and RBAC enforcement are provided by the enterprise auth provider (enterprise repo).
    • Enterprise add-ons are delivered from the enterprise repo; this repo stays platform-only.
  • Dashboard (dashboard/)

    • React UI served via Nginx; connects to /api/v1 and /api/v1/stream.
    • Runtime config via /config.json (API base URL, API key, tenant, optional principal for enterprise auth).
    • Governance pages:
      • /govern/replay — policy replay (what-if analysis: replay historical jobs against candidate policies).
      • /govern/analytics — policy rule analytics (hit counts, override rates, approval latency histograms).
      • /approvals/:jobId — approval detail with 6 context sections: what, why blocked, blast radius, risk, prior history, rollback guidance.
  • Scheduler (core/controlplane/scheduler, cmd/cordum-scheduler; binary cordum-scheduler)

    • Subscribes to sys.job.submit, sys.job.result, sys.job.cancel, sys.heartbeat.
    • Calls Safety Kernel before dispatch (allow/deny/approve/throttle/constraints). When the Safety Kernel is unreachable, WithInputFailMode controls behavior: closed (default) denies the job, open allows it. Both gateway and scheduler use POLICY_CHECK_FAIL_MODE — the gateway evaluates policy at submit time, the scheduler at dispatch time.
    • Routes jobs using pool mapping + least-loaded strategy, labels, and requires-based pool eligibility.
    • Persists job state in Redis and emits DLQ for non-success results.
    • Reconciler marks stale DISPATCHED/RUNNING jobs as TIMEOUT.
    • Pending replayer retries PENDING jobs past the dispatch timeout to avoid stuck runs.
  • Safety Kernel (core/controlplane/safetykernel, cmd/cordum-safety-kernel; binary cordum-safety-kernel)

    • gRPC Check, Evaluate, Explain, Simulate; uses config/safety.yaml.
    • Deny/allow by tenant/topic, plus MCP allow/deny lists and constraints.
    • Loads policy bundles from file/URL plus config-service fragments (supports bundle enabled=false), with snapshot hashing and hot reload.
    • Applies effective config embedded in job env.
    • Returns optional remediations; gateway can apply them to create a new job with safer topic/capability/labels.
    • Optional decision cache (SAFETY_DECISION_CACHE_TTL) keeps latency low for repeated checks.
    • Server-side risk tag derivation: packs can declare a riskTagDeriver on topics. When registered, the kernel derives authoritative risk tags from job content (_content.payload_json label) instead of trusting client-supplied tags. Prevents risk tag spoofing. Built-in derivers: amount-threshold (parses amount from JSON payload, used by mock-bank).
  • Workflow Engine (core/workflow, cmd/cordum-workflow-engine; binary cordum-workflow-engine)

    • Stores workflow definitions and runs in Redis; maintains run timeline.
    • Dispatches ready steps as jobs (sys.job.submit).
    • Supports condition, delay, notify, for_each fan-out, retries/backoff, approvals, run cancel.
    • depends_on enables DAG execution: independent steps run in parallel; steps wait for all deps to succeed.
    • Failed/cancelled/timed-out/denied deps block downstream steps (no implicit continue-on-error).
    • Denied is a first-class terminal status: JOB_STATUS_DENIED maps to StepStatusDenied and propagates to RunStatusDenied (not RunStatusFailed). Denied steps support on_error recovery chains just like failed steps. The status pipeline reports denied in its own bucket, separate from failed.
    • Supports rerun-from-step and dry-run mode.
    • Validates workflow input and step input/output schemas.
    • Subscribes to sys.job.result to advance runs; reconciler retries stuck runs.
  • Context Engine (core/contextwindow/engine, cmd/cordum-context-engine; binary cordum-context-engine)

    • gRPC service for BuildWindow and UpdateMemory.
    • Maintains chat history and generic memory under mem:<memory_id>:*.
  • MCP Server (cmd/cordum-mcp; binary cordum-mcp)

    • Model Context Protocol bridge that exposes Cordum capabilities (jobs, workflows, policy) as MCP tools and resources.
  • Licensing (core/licensing/)

    • Ed25519-signed license loading and validation with three tiers (Community/Team/Enterprise).
    • Entitlement enforcement across services: gateway rate limits, scheduler concurrency caps, workflow step limits, safety kernel policy bundle quotas, audit retention periods.
    • Graceful degradation to Community tier on license expiry.
    • Config: CORDUM_LICENSE_FILE, CORDUM_LICENSE_TOKEN, CORDUM_LICENSE_PUBLIC_KEY.
  • Telemetry (core/telemetry/)

    • Structured metrics collection across all services; Prometheus exposition on each service's metrics port.
    • License-tier-aware retention and export controls.
  • External workers (not in this repo)

    • Subscribe to job topics or direct subjects; honor sys.job.cancel.
    • Write results to Redis and publish sys.job.result.
    • Use the CAP runtime in sdk/runtime for typed handlers + pointer hydration; use CAP worker helpers for heartbeats/progress/cancel when needed.
    • MCP integration lives in cordum-packs/packs/mcp-bridge, which exposes MCP tools/resources over stdio and submits MCP tool calls as Cordum jobs.

Job lifecycle (single job)

  1. Client or gateway writes input JSON to Redis at ctx:<job_id>.
    • Before persisting, the gateway evaluates submit-time policy via the Safety Kernel. Jobs denied by policy are rejected immediately (HTTP 403 / gRPC PermissionDenied) and never reach the bus or scheduler. Throttled jobs receive 429 / ResourceExhausted. Jobs requiring approval are created in APPROVAL state without publishing.
  2. Publish BusPacket{JobRequest} to sys.job.submit with context_ptr.
  3. Scheduler:
    • Sets job state PENDING, resolves effective config, runs safety check.
    • Picks a subject (worker.<id>.jobs or job.*) and dispatches.
    • Pending replayer replays old PENDING jobs past the dispatch timeout.
    • If approval is required, state becomes APPROVAL_REQUIRED; approvals are bound to the policy snapshot + job hash before requeueing.
    • If remediations are returned, the gateway can apply one via POST /api/v1/jobs/{id}/remediate (creates a new job).
  4. Worker:
    • Loads context from context_ptr, runs work, writes res:<job_id>.
    • Publishes BusPacket{JobResult} to sys.job.result.
  5. Scheduler:
    • Updates terminal state and stores result_ptr.
    • Emits DLQ entry for terminal failures except FAILED_RETRYABLE.
    • FAILED_FATAL triggers saga rollback (compensation stack).
  6. Reconciler marks stale jobs TIMEOUT based on config/timeouts.yaml.
  7. Cancellation: API or workflow engine publishes BusPacket{JobCancel} to sys.job.cancel; workers cancel in-flight jobs.

Workflow runs

  • Workflows are defined in Redis (core/workflow).
  • A run is created via /api/v1/workflows/{id}/runs.
  • Steps are dispatched as jobs using job IDs run_id:step_id@attempt.
  • Step input supports simple expressions (core/workflow/eval.go) and template expansion.
  • for_each steps fan out child jobs with foreach_index and foreach_item env fields.
  • Approval steps pause the run until /approve is called.
  • Runs and workflows can be deleted via DELETE /api/v1/workflow-runs/{id} and DELETE /api/v1/workflows/{id}.
  • Runs support idempotency keys via Idempotency-Key header on run creation.

Protocols

  • Bus and safety messages are CAP v2 types (no local duplicates):
    • BusPacket, JobRequest, JobResult, Heartbeat, PolicyCheck*
    • See github.com/cordum-io/cap/v2/cordum/agent/v1.
  • Local gRPC APIs:
    • CordumApi (submit job, get status) in core/protocol/proto/v1/api.proto (gRPC service name).
    • ContextEngine in core/protocol/proto/v1/context.proto
  • Generated Go types live in core/protocol/pb/v1.

Bus subjects and delivery

Subjects:

  • sys.job.submit, sys.job.result, sys.job.progress, sys.job.dlq, sys.job.cancel, sys.heartbeat, sys.workflow.event
  • job.* pool subjects
  • worker.<id>.jobs direct worker subjects

JetStream (optional):

  • Enable with NATS_USE_JETSTREAM=1.
  • Durable subjects: sys.job.submit, sys.job.result, sys.job.dlq, job.*, worker.<id>.jobs.
  • Best-effort: sys.heartbeat, sys.job.cancel, sys.job.progress, sys.workflow.event.
  • Handlers are idempotent via Redis locks and retryable error NAKs.

Redis key map (selected)

  • Context/result:
    • ctx:<job_id> -> input payload
    • res:<job_id> -> result payload
    • art:<id> -> artifact payload
  • Job store:
    • job:meta:<job_id> (state + metadata)
    • job:state:<job_id> (state)
    • job:recent (sorted set)
    • job:index:<state> (sorted sets for reconciliation)
    • job:deadline (sorted set of deadlines)
    • job:events:<job_id> (state transition log)
    • trace:<trace_id> (set of job ids)
  • Context engine:
    • mem:<memory_id>:events, mem:<memory_id>:chunks, mem:<memory_id>:summary
  • Workflow engine:
    • wf:def:<workflow_id> (definitions)
    • wf:run:<run_id> plus run indexes (wf:runs:*)
    • wf:run:timeline:<run_id> (append-only timeline)
    • wf:run:idempotency:<key> (idempotency mapping)
  • DLQ:
    • dlq:entry:<job_id>, dlq:index
  • Config service:
    • cfg:<scope>:<id>
    • cfg:system:policy (policy fragments bundle)
    • cfg:system:packs (installed pack registry)
  • Schema registry:
    • schema:<id>, schema:index
  • Locks:
    • lock:<key> (plus owner/ttl metadata)

Binaries (cmd)

  • cordum-api-gateway
  • cordum-scheduler
  • cordum-safety-kernel
  • cordum-workflow-engine
  • cordum-context-engine
  • cordumctl (CLI)
  • cordum-mcp (MCP server)

Repo layout

  • core/ control plane, infra, protocols, workflow engine.
  • cmd/ platform binaries.

Topics -> pools

See config/pools.yaml for the full map. Topics are config-driven; no core topics are enforced.

Observability

  • Scheduler metrics: :9090/metrics
  • API gateway metrics: :9092/metrics
  • Workflow engine health: :9093/health

Testing

  • Run go test ./... (use GOCACHE=$(pwd)/.cache/go-build if needed).
  • If modifying .proto, run make proto.
  • Platform smoke: bash ./tools/scripts/platform_smoke.sh.