Core Architecture

Principles

Core is “boring”

Core knows only:

jobs, workflows, state, pointers, config, policy, audit
scheduling, retries, timeouts, DLQ
approvals, budgets, and constraints

Core must not know:

Kubernetes, GitHub/GitLab, Datadog/Coralogix, Sentry, LLM providers
“incident”, “PR”, “runbook”, “patch generation” semantics
tool-specific topics or behavior

Packages are “installable behaviors”

A package is an overlay on the platform:

adds topics, workers, workflow templates, and config/policy overlays
uses core APIs/contracts exactly as-is
never requires core code changes for “new product logic”

2) Workflow Engine as a deterministic state machine

Status: Implemented in core/workflow and cmd/cordum-workflow-engine (binary cordum-workflow-engine).

Core workflow engine must support vanilla steps that don’t require packages:

approval (human gate)
delay (timer)
condition (evaluated expression, boolean output)
notify (emits a SystemAlert on the bus)
worker (dispatches a job to a topic/pool)
for_each (fan-out over array items; optional max_parallel throttle)
depends_on DAG dependencies (steps run when all deps succeed; independent steps run in parallel)

Required properties:

durable run state (crash/restart safe)
step retries with backoff + max attempts
timeouts per step
cancel propagation (run cancel stops running steps)
full run timeline (inputs/outputs pointers, status transitions)
schema validation for workflow input and step input/output
rerun-from-step and dry-run mode
dependency gating: failed/cancelled/timed-out deps block downstream steps (no implicit continue-on-error)

Why packages need it: packages are just workflows + workers. If the workflow engine isn’t bulletproof, the “Incident→PR” product will be unreliable.

How packages use this (without touching core)

A package ships workflow templates that contain worker steps pointing to the package’s topics.
Core executes the same state machine regardless of what the steps mean.
If the package is uninstalled, those topics simply become unmapped and will DLQ.

4) Safety Kernel as the single Policy Decision Point

Status: Implemented; gRPC Check/Evaluate/Explain/Simulate with snapshotting.

Core kernel must evaluate a request and return:

ALLOW
DENY
REQUIRE_APPROVAL
ALLOW_WITH_CONSTRAINTS
(rewrite budgets, sandbox, command allowlist, redaction level)
Optional remediations that suggest safer alternatives (topic/capability/label tweaks).

Policy management (P0 minimum):

policy bundles loaded from file/URL + config service fragments (cfg:system:policy bundles)
config-service bundles may include metadata (author, message, timestamps) and an enabled toggle; admin overlays live under the secops/ prefix
signed + hot reload to new PolicySnapshot(version, hash)
last-known-good fallback if verification fails
decision audit record for every request:
{rule_id, version, decision, constraints, reason}

Safety kernel config-service source can be tuned via SAFETY_POLICY_CONFIG_SCOPE, SAFETY_POLICY_CONFIG_ID, SAFETY_POLICY_CONFIG_KEY (or disabled with SAFETY_POLICY_CONFIG_DISABLE=1).

Why packages need it: without this, “safe autopatcher” is marketing, not reality.

How packages use this (without touching core)

Packages do not implement security. They declare:
- topics/tools they need
- capability labels + risk tags
Admins install policy overlays that:
- allow/deny specific capabilities
- require approvals for risky actions (prod writes, PR creation, shell exec)
- impose constraints (max diff size, deny-paths, network restrictions)
Kernel gates every job/run/tool call before execution.
When policy provides remediations, the gateway can apply them to create a new job without hand-editing inputs.

6) Worker Runtime SDK (even if you ship zero workers)

Status: Implemented in sdk/runtime (wraps CAP runtime).

Core should ship a tiny Go library that defines:

how a worker connects/subscribes to job topics
how it loads context and writes results via pointers
how it retries handlers with bounded attempts
how it verifies/signs CAP envelopes (optional)
how it exposes hooks for logging/observability

Use CAP worker helpers when you need heartbeats/progress/cancel handling.

Why packages need it: consistent worker behavior + fewer “mystery outages”.

How packages use this (without touching core)

Package worker repos import the SDK.
Upgrades become predictable (protocol_version + SDK versioning).
Core doesn’t need to change for every new worker.

P1 — Needed for first real packages

(SRE Investigator + MCP adapter)

8) Artifact store abstraction (Redis now, pluggable later)

Status: Implemented with a Redis-backed store and retention classes.

You need a standard interface:

PutArtifact(content, metadata) -> artifact_ptr
GetArtifact(ptr)

Support:

size limits
TTL/retention classes (e.g., 7d/30d)
optional encryption at rest (later)

Why packages need it: logs, test outputs, diffs, evidence = artifacts. Don’t shove them into Redis ctx.

Package usage model

SRE package stores “evidence bundle” as artifacts (log tails, kubectl output, CI logs).
PR summaries link artifacts by pointer.
Core remains unchanged; only the artifact storage backend may be swapped later.

10) Capability‑based routing (not just topic→pool)

Status: Implemented via pool capability profiles (config/pools.yaml) and JobMetadata.requires.

Extend scheduler mapping to support constraints:

pool requires: docker, git, kubectl, network:egress, cpu, mem, gpu
job declares: requires=[...], risk=[...]

Why packages need it: repo verify needs toolchain; LLM needs GPU; collectors need network.

Package usage model

Package job submission includes requires.
Scheduler chooses eligible pool without knowing anything about the domain.

12) Replay + re‑run semantics

Status: Implemented: rerun-from-step, dry-run, and run idempotency keys.

You need:

rerun a run from step N
rerun with same inputs (immutable pointers)
“dry‑run” mode (no external side effects)

Why packages need it: debugging and safe iteration.

Package usage model

“Incident→PR” can be re-run after policy updates or worker fixes.
Dry-run supports “propose patch but don’t open PR” safely.

Pack‑Ready Hooks to Include Now (even before packages exist)

Add these fields to the job metadata today:

tenant_id, actor_id, actor_type
idempotency_key
pack_id (optional, empty now)
capability (semantic action label, not just topic)
risk_tags (prod/write/network/secrets/exec)
requires (capabilities for routing)

Why this matters: it lets future packages plug into the same enforcement/routing/audit machinery without core changes.

Workflow steps support a meta block that maps to JobMetadata, so package templates can declare capability, risk_tags, requires, and pack_id at the step level without touching core.

2) Schema validation as a first‑class primitive

Core should support:

registering JSON Schemas (or accepting inline schemas with workflows/jobs)
validating job inputs/outputs and step outputs

Why: packages become reliable and debuggable; you stop passing mystery blobs between steps.

Status: Implemented with Redis-backed schema registry and workflow input/step IO validation.

Package integration

SRE package enforces a schema for IncidentContext, EvidenceBundle, PatchPlan.
Kernel can reject malformed or suspicious inputs early.

4) Runner profiles + constraints (without shipping workers)

Even if core ships zero workers, define execution profiles packages can request:

sandbox=isolated
network=none|egress-allowlist
fs=ro|rw
tools=git,kubectl,go

Scheduler routes based on requires[].

Why: lets you enforce “this job can’t touch network” at the platform level.

Status: Partially implemented: scheduler routes by requires and constraints are passed via env; sandbox enforcement is up to workers/runners.

Package integration

Collectors request network egress; LLM steps request “no network”.
Kernel enforces that risky steps can’t run in permissive profiles.

6) Immutable run/event log (append‑only timeline)

Maintain an append-only timeline:

state transitions, decisions, approvals, dispatches, result pointers

Why: audit, replay, postmortems, “why did it do that?”

Status: Implemented (run timeline stored in Redis and exposed via gateway).

Package integration

SRE Investigator PR body can link to a canonical run timeline.
MCP calls can be fully reconstructed for compliance.

8) Idempotency keys + dedupe across the control plane

Make it explicit:

idempotency_key on submit/run
dedupe window + stable semantics

Why: webhook storms, retries, MCP clients will otherwise duplicate actions.

Status: Implemented for job submission and workflow run creation.

Package integration

Incident ingest uses incident_id as idempotency key.
“Open PR” step uses incident_id + repo + branch as dedupe key.

Don’t Add to Core (It Will Rot You)

Datadog/Coralogix/GitHub/K8s connectors (packages only)
LLM providers / prompt logic (packages only)
SRE Investigator logic (package)
MCP proxy/controller logic (separate service — cmd/cordum-mcp/ provides the MCP server bridge)

Core should provide governance + runtime, not domain logic.

How Packages Use Core Without Touching Core Code (Concrete Example: SRE Investigator)

When you install sre-investigator later, it should consist of:

workers (containers) that subscribe to job.sre.* topics
workflow templates that orchestrate those workers
overlays:
- pools.overlay.yaml mapping job.sre.* → sre-investigator-pool
- timeouts.overlay.yaml for collector/verify steps
- safety.overlay.yaml adding:
  - allowlist for read-only collectors
  - require approval for PR creation in prod
  - constraints: deny-paths, max diff size, network rules

Core stays unchanged because:

scheduler already routes by config
workflow engine already supports job dispatch + approvals + retries
kernel already evaluates capability/risk + applies constraints
artifact pointers already store evidence
audit log already records decisions and run timeline

Net effect: new product behavior appears by installing overlays + deploying workers, not editing core.

Cordum Platform Core: Must‑Have Features (Pack‑Ready)

This document defines what must live in platform core so that future packages (e.g., SRE Investigator, MCP adapters, repo tooling, etc.) can be installed and run without touching core code.

Design goal: core provides governance + runtime primitives. Packages provide domain logic (workers, connectors, workflows, UIs).

Protocol: CAP v2 is the canonical wire contract for bus and safety messages.

Principles

Core is “boring”

Core knows only:

jobs, workflows, state, pointers, config, policy, audit
scheduling, retries, timeouts, DLQ
approvals, budgets, and constraints

Core must not know:

Kubernetes, GitHub/GitLab, Datadog/Coralogix, Sentry, LLM providers
“incident”, “PR”, “runbook”, “patch generation” semantics
tool-specific topics or behavior

Packages are “installable behaviors”

A package is an overlay on the platform:

adds topics, workers, workflow templates, and config/policy overlays
uses core APIs/contracts exactly as-is
never requires core code changes for “new product logic”

P0 — Non‑Negotiable Core

(without these, packages will be hacks)

1) Stable protocol contract (jobs + results + pointers)

Status: Implemented using CAP v2 (github.com/cordum-io/cap/v2) with aliases in core/protocol/pb/v1.

Core must define and enforce:

BusPacket (CAP envelope)
- trace_id, sender_id, created_at, protocol_version
- payload oneof includes JobRequest, JobResult, Heartbeat, JobProgress, JobCancel, SystemAlert
Pointers
- ContextPointer / ResultPointer / ArtifactPointer (store references, not big blobs)
Control events
- JobCancel + Heartbeat + JobProgress
DLQ format
- error_code + error_message + last_state + attempts (stored in DLQ entries)

Why packages need it: every worker and external integration becomes predictable, replayable, auditable.

Hard rule: version the envelope (protocol_version) so packages don’t break as the platform evolves.

How packages use this (without touching core)

A package defines new topics (e.g., job.sre.collect.k8s, job.sre.patch.generate).
Workers subscribe to those topics and always speak BusPacket{JobRequest/JobResult}.
Workers write outputs to ResultPointer or ArtifactPointer.
Core doesn’t need to “know” the meaning of the outputs.

2) Workflow Engine as a deterministic state machine

Status: Implemented in core/workflow and cmd/cordum-workflow-engine (binary cordum-workflow-engine).

Core workflow engine must support vanilla steps that don’t require packages:

approval (human gate)
delay (timer)
condition (evaluated expression, boolean output)
notify (emits a SystemAlert on the bus)
worker (dispatches a job to a topic/pool)
for_each (fan-out over array items; optional max_parallel throttle)
depends_on DAG dependencies (steps run when all deps succeed; independent steps run in parallel)

Required properties:

durable run state (crash/restart safe)
step retries with backoff + max attempts
timeouts per step
cancel propagation (run cancel stops running steps)
full run timeline (inputs/outputs pointers, status transitions)
schema validation for workflow input and step input/output
rerun-from-step and dry-run mode
dependency gating: failed/cancelled/timed-out deps block downstream steps (no implicit continue-on-error)

Why packages need it: packages are just workflows + workers. If the workflow engine isn’t bulletproof, the “Incident→PR” product will be unreliable.

How packages use this (without touching core)

A package ships workflow templates that contain worker steps pointing to the package’s topics.
Core executes the same state machine regardless of what the steps mean.
If the package is uninstalled, those topics simply become unmapped and will DLQ.

3) Scheduler that is purely config‑driven (no hardcoded “core topics”)

Status: Implemented; routing comes from config/pools.yaml (topics + pool capabilities).

Core scheduler must do only:

topic → pool mapping (from config)
leasing/dispatch semantics
timeouts, retries, DLQ
pool backpressure (overload detection)

If no mapping exists: fail fast to DLQ with no_pool_mapping.

Why packages need it: installing a package becomes “add mapping + deploy workers”, not “change scheduler code”.

How packages use this (without touching core)

Package installation adds config overlays:
- pools.overlay.yaml (topic → pool)
- timeouts.overlay.yaml (step/job timeouts)
Workers come online in that pool.
Scheduler behavior remains unchanged.

CAP v2.5.2 Protocol Features (Scheduler)

Handshake handling: The scheduler subscribes to sys.handshake and processes BusPacket{Handshake} messages. Worker-role handshakes update the in-memory worker registry with component capabilities, enabling capability-aware routing.
ErrorCode enum: Job failures now carry a structured error_code_enum (ErrorCode enum) alongside the deprecated string error_code. The scheduler auto-populates error_code_enum from the string code when only the string is provided (e.g., "timeout" maps to ERROR_CODE_JOB_TIMEOUT). DLQ entries also carry the structured code.
Bus-layer validation: Incoming JobRequest and JobResult packets are validated using CAP SDK helpers (ValidateJobRequest/ValidateJobResult). Invalid packets are rejected, logged, and counted via the validation_rejections_total metric.
Enhanced SystemAlert: Alerts emitted by the workflow engine now include severity (enum), source_component, details (map), and trace_id alongside the deprecated string fields.

4) Safety Kernel as the single Policy Decision Point

Status: Implemented; gRPC Check/Evaluate/Explain/Simulate with snapshotting.

Core kernel must evaluate a request and return:

ALLOW
DENY
REQUIRE_APPROVAL
ALLOW_WITH_CONSTRAINTS
(rewrite budgets, sandbox, command allowlist, redaction level)
Optional remediations that suggest safer alternatives (topic/capability/label tweaks).

Policy management (P0 minimum):

policy bundles loaded from file/URL + config service fragments (cfg:system:policy bundles)
config-service bundles may include metadata (author, message, timestamps) and an enabled toggle; admin overlays live under the secops/ prefix
signed + hot reload to new PolicySnapshot(version, hash)
last-known-good fallback if verification fails
decision audit record for every request:
{rule_id, version, decision, constraints, reason}

Safety kernel config-service source can be tuned via SAFETY_POLICY_CONFIG_SCOPE, SAFETY_POLICY_CONFIG_ID, SAFETY_POLICY_CONFIG_KEY (or disabled with SAFETY_POLICY_CONFIG_DISABLE=1).

Why packages need it: without this, “safe autopatcher” is marketing, not reality.

How packages use this (without touching core)

Packages do not implement security. They declare:
- topics/tools they need
- capability labels + risk tags
Admins install policy overlays that:
- allow/deny specific capabilities
- require approvals for risky actions (prod writes, PR creation, shell exec)
- impose constraints (max diff size, deny-paths, network restrictions)
Kernel gates every job/run/tool call before execution.
When policy provides remediations, the gateway can apply them to create a new job without hand-editing inputs.

5) Config Service with overlays (the pack hook)

Status: Implemented; Redis-backed merge with version/hash snapshot.

Even before you “do packages”, core must support overlay config:

base config (platform)
optional fragments (future packages)

Must support:

merged “effective config” snapshot with a version/hash
live reload with rollback (scheduler reloads pools/timeouts)
per-tenant overrides (later)

Why packages need it: package install becomes “drop overlay files”, not edit core config manually.

How packages use this (without touching core)

A package ships overlays:
- routing (pools)
- policy fragments (stored under cfg:system:policy bundles)
- budgets/timeouts
- optional schema registrations
Installer merges overlays into config service.
Core reads the “effective config” snapshot and behaves accordingly.

6) Worker Runtime SDK (even if you ship zero workers)

Status: Implemented in sdk/runtime (wraps CAP runtime).

Core should ship a tiny Go library that defines:

how a worker connects/subscribes to job topics
how it loads context and writes results via pointers
how it retries handlers with bounded attempts
how it verifies/signs CAP envelopes (optional)
how it exposes hooks for logging/observability

Use CAP worker helpers when you need heartbeats/progress/cancel handling.

Why packages need it: consistent worker behavior + fewer “mystery outages”.

How packages use this (without touching core)

Package worker repos import the SDK.
Upgrades become predictable (protocol_version + SDK versioning).
Core doesn’t need to change for every new worker.

7) Control‑plane APIs (gateway) that are pack‑agnostic

Status: Implemented in cmd/cordum-api-gateway (HTTP/WS + gRPC; binary cordum-api-gateway).

Package structure (core/controlplane/gateway/):

Sub-package	Purpose
`gateway/` (root)	HTTP/gRPC handlers, middleware chain, MCP bridge, server lifecycle (~20 source files)
`gateway/auth/`	Auth providers (API key, basic, OIDC/JWT, composite), Redis-backed user and key stores
`gateway/packs/`	Pack types, constants, manifest validation, marketplace utilities, tar extraction
`gateway/policybundles/`	Policy bundle types, YAML rule parsing, policy merging, evaluation helpers, audit formatting

Dependency graph: gateway → {auth, packs, policybundles}, policybundles → {auth, packs}. No circular imports.

At minimum:

Workflows: create/list/get/delete
Runs: start/get/list/cancel/delete, rerun, timeline
Approvals: job approvals (including workflow gate approvals)
Jobs: submit/status/get result pointer, cancel, remediate
DLQ: list/retry/delete
Policy: evaluate/simulate/explain + snapshot list
Config: get/set/effective
Schemas: register/get/list/delete
Locks: acquire/release/renew/get
Artifacts: put/get
Audit: decisions + run timeline

Why packages need it: packages use the same APIs for operations, UI, and integrations.

How packages use this (without touching core)

A package registers workflow templates (optional).
A package (or an external client) triggers runs via the gateway.
Ops tooling uses the same APIs to debug failures and inspect evidence pointers.

P1 — Needed for first real packages

(SRE Investigator + MCP adapter)

8) Artifact store abstraction (Redis now, pluggable later)

Status: Implemented with a Redis-backed store and retention classes.

You need a standard interface:

PutArtifact(content, metadata) -> artifact_ptr
GetArtifact(ptr)

Support:

size limits
TTL/retention classes (e.g., 7d/30d)
optional encryption at rest (later)

Why packages need it: logs, test outputs, diffs, evidence = artifacts. Don’t shove them into Redis ctx.

Package usage model

SRE package stores “evidence bundle” as artifacts (log tails, kubectl output, CI logs).
PR summaries link artifacts by pointer.
Core remains unchanged; only the artifact storage backend may be swapped later.

9) Secrets reference model (never pass secrets as plaintext)

Status: Partially implemented: secret:// detection + redaction helpers, policy enforcement via risk tags/labels.

Core must support:

“secret refs”
e.g., secret://vault/path#key or secret://k8s/ns/name
redaction utility for logs/evidence before LLM
kernel rules can block flows if secrets_present detected

Why packages need it: SRE investigator touches logs and env. This is where you get burned.

Package usage model

Workers never read raw secrets unless policy allows and runner profile permits.
Evidence is redacted before it becomes an artifact or LLM input.
Kernel constraints enforce “no secret material to LLM”.

10) Capability‑based routing (not just topic→pool)

Status: Implemented via pool capability profiles (config/pools.yaml) and JobMetadata.requires.

Extend scheduler mapping to support constraints:

pool requires: docker, git, kubectl, network:egress, cpu, mem, gpu
job declares: requires=[...], risk=[...]

Why packages need it: repo verify needs toolchain; LLM needs GPU; collectors need network.

Package usage model

Package job submission includes requires.
Scheduler chooses eligible pool without knowing anything about the domain.

11) Budgets + quotas (enforced by kernel + scheduler)

Status: Partially implemented: safety constraints for max runtime/retries/artifact bytes/concurrency; gateway enforces max concurrent runs.

Per tenant / per actor:

max concurrent runs
max runtime
max artifact bytes
max retries
max PR size (files/lines changed) via constraints

Why packages need it: “agent went wild” becomes bounded damage.

Package usage model

SRE package PR creation step is constrained:
- max files, max lines, deny paths, require approval in prod
Kernel returns ALLOW_WITH_CONSTRAINTS that the workflow engine/scheduler must honor.

12) Replay + re‑run semantics

Status: Implemented: rerun-from-step, dry-run, and run idempotency keys.

You need:

rerun a run from step N
rerun with same inputs (immutable pointers)
“dry‑run” mode (no external side effects)

Why packages need it: debugging and safe iteration.

Package usage model

“Incident→PR” can be re-run after policy updates or worker fixes.
Dry-run supports “propose patch but don’t open PR” safely.

P2 — Enterprise‑grade

(don’t block MVP, but know what’s coming)

13) Identity + tenancy model that won’t paint you into a corner

P2 core should evolve to:

OIDC/JWT auth for humans
service-to-service auth (mTLS or signed tokens)
RBAC for control plane actions
tenant isolation for data (ctx/res/artifacts)

14) Full observability

structured logs with trace_id/run_id/job_id
Prometheus metrics across core services
tracing propagation

15) Versioned migrations and backward compat

state store migrations (workflow schema evolution)
protocol version negotiation
“last-known-good” configs/policies

16) Enterprise licensing and entitlements

Status: Implemented. Licensing lives in core/licensing/ with Ed25519 signature verification and three tiers (Community/Team/Enterprise). Entitlement enforcement is applied across all services: gateway rate limits, scheduler concurrency caps, workflow step limits, safety kernel policy bundle quotas, and audit retention periods. Licenses degrade gracefully on expiry (services continue at Community tier). Enterprise add-ons (license issuance, SSO/SAML, advanced RBAC, SIEM export, support SLA) live in the enterprise and tools repos; this repo provides the loading, validation, and tier enforcement layer.

Pack‑Ready Hooks to Include Now (even before packages exist)

Add these fields to the job metadata today:

tenant_id, actor_id, actor_type
idempotency_key
pack_id (optional, empty now)
capability (semantic action label, not just topic)
risk_tags (prod/write/network/secrets/exec)
requires (capabilities for routing)

Why this matters: it lets future packages plug into the same enforcement/routing/audit machinery without core changes.

Workflow steps support a meta block that maps to JobMetadata, so package templates can declare capability, risk_tags, requires, and pack_id at the step level without touching core.

Extra Core Primitives (High Leverage, Still Platform‑Pure)

These are additions that pay off massively later without turning core into product soup.

1) Policy “explain” + “simulate” APIs (security teams will demand this)

POST /api/v1/policy/evaluate → decision + matched rule_id + constraints
POST /api/v1/policy/simulate → same, but no side effects (for CI / PR reviews)
GET /api/v1/policy/snapshots → version/hash currently loaded
GET /api/v1/policy/bundles → list policy bundles
GET /api/v1/policy/bundles/{id} → bundle detail
PUT /api/v1/policy/bundles/{id} → update bundle (requires X-Principal-Role: admin)
POST /api/v1/policy/bundles/{id}/simulate → simulate against draft bundle
POST /api/v1/policy/publish → publish bundles (requires X-Principal-Role: admin)
POST /api/v1/policy/rollback → rollback bundles (requires X-Principal-Role: admin)
GET /api/v1/policy/audit → policy publish/rollback audit

Why: makes policy changes reviewable and prevents “security theater”.

Status: Implemented in gateway and safety kernel.

Bundle IDs include / (e.g. secops/workflows). Replace / with ~ in the {id} path segment or use the bundle_id query parameter.

Package integration

Package install pipelines can simulate policies before deployment.
Admins can validate “will SRE Investigator be allowed to open PRs in prod?” before enabling.

2) Schema validation as a first‑class primitive

Core should support:

registering JSON Schemas (or accepting inline schemas with workflows/jobs)
validating job inputs/outputs and step outputs

Why: packages become reliable and debuggable; you stop passing mystery blobs between steps.

Status: Implemented with Redis-backed schema registry and workflow input/step IO validation.

Package integration

SRE package enforces a schema for IncidentContext, EvidenceBundle, PatchPlan.
Kernel can reject malformed or suspicious inputs early.

3) Resource locks / concurrency guards (prevents chaos)

A tiny “lock service” inside core:

lock by {repo}, {cluster/ns}, {service/env}, {incident_id}
modes: shared/exclusive, TTL, owner

Why: once you run autopatcher or MCP actions, two workflows racing will wreck you.

Status: Implemented with Redis-backed shared/exclusive locks and gateway APIs.

Package integration

SRE Investigator acquires exclusive lock on {service/env} before patch generation/PR open.
Verify steps can hold shared locks; mutation steps require exclusive.

4) Runner profiles + constraints (without shipping workers)

Even if core ships zero workers, define execution profiles packages can request:

sandbox=isolated
network=none|egress-allowlist
fs=ro|rw
tools=git,kubectl,go

Scheduler routes based on requires[].

Why: lets you enforce “this job can’t touch network” at the platform level.

Status: Partially implemented: scheduler routes by requires and constraints are passed via env; sandbox enforcement is up to workers/runners.

Package integration

Collectors request network egress; LLM steps request “no network”.
Kernel enforces that risky steps can’t run in permissive profiles.

5) Artifact store abstraction + retention classes

Standardize:

artifact_ptr
retention class (short, standard, audit)
max size + chunking policy

Why: avoids shoving megabytes into Redis ctx and gives audit durability.

Status: Implemented (Redis-backed artifacts + retention classes).

Package integration

“evidence” is audit retention; “temp logs” are short retention.

6) Immutable run/event log (append‑only timeline)

Maintain an append-only timeline:

state transitions, decisions, approvals, dispatches, result pointers

Why: audit, replay, postmortems, “why did it do that?”

Status: Implemented (run timeline stored in Redis and exposed via gateway).

Package integration

SRE Investigator PR body can link to a canonical run timeline.
MCP calls can be fully reconstructed for compliance.

7) First‑class budgets enforced by kernel + scheduler

Budgets are safety:

max runtime, max retries, max artifact bytes, max concurrent runs
max diff size, max files touched, deny-path patterns (as constraints)

Why: keeps early packages safe and sellable.

Status: Partially implemented (policy constraints for runtime/retries/artifacts/concurrency).

Package integration

SRE patch generation constrained to max_files_changed, max_lines_changed.
Kernel can auto-rewrite budgets per environment (prod stricter than dev).

8) Idempotency keys + dedupe across the control plane

Make it explicit:

idempotency_key on submit/run
dedupe window + stable semantics

Why: webhook storms, retries, MCP clients will otherwise duplicate actions.

Status: Implemented for job submission and workflow run creation.

Package integration

Incident ingest uses incident_id as idempotency key.
“Open PR” step uses incident_id + repo + branch as dedupe key.

9) Ops surfaces (CLI + optional dashboard)

Ship cordumctl that can:

create/run/delete workflows
approve/reject
inspect run timeline
retry DLQ

Optional: a lightweight dashboard that talks to the gateway for run/status visibility.

Why: bring-up, debugging, demos without requiring a full UI stack.

Status: Implemented (cmd/cordumctl + smoke script, plus dashboard/; ships as cordumctl).

Package integration

Ops can run: cordumctl pack install <path|url> / cordumctl pack uninstall <id> / cordumctl pack verify <id>
CLI still drives core workflows and approvals with no packs installed.

Don’t Add to Core (It Will Rot You)

Datadog/Coralogix/GitHub/K8s connectors (packages only)
LLM providers / prompt logic (packages only)
SRE Investigator logic (package)
MCP proxy/controller logic (separate service — cmd/cordum-mcp/ provides the MCP server bridge)

Core should provide governance + runtime, not domain logic.

If You Add Only 3 Things, Add These

Policy explain/simulate
Resource locks
Runner profiles + requires/constraints routing

These three are what make future packages safe and enterprise-real instead of toys.

How Packages Use Core Without Touching Core Code (Concrete Example: SRE Investigator)

When you install sre-investigator later, it should consist of:

workers (containers) that subscribe to job.sre.* topics
workflow templates that orchestrate those workers
overlays:
- pools.overlay.yaml mapping job.sre.* → sre-investigator-pool
- timeouts.overlay.yaml for collector/verify steps
- safety.overlay.yaml adding:
  - allowlist for read-only collectors
  - require approval for PR creation in prod
  - constraints: deny-paths, max diff size, network rules

Core stays unchanged because:

scheduler already routes by config
workflow engine already supports job dispatch + approvals + retries
kernel already evaluates capability/risk + applies constraints
artifact pointers already store evidence
audit log already records decisions and run timeline

Net effect: new product behavior appears by installing overlays + deploying workers, not editing core.

Principles​

Core is “boring”​

Packages are “installable behaviors”​

2) Workflow Engine as a deterministic state machine​

How packages use this (without touching core)​

4) Safety Kernel as the single Policy Decision Point​

How packages use this (without touching core)​

6) Worker Runtime SDK (even if you ship zero workers)​

How packages use this (without touching core)​

P1 — Needed for first real packages

8) Artifact store abstraction (Redis now, pluggable later)​

Package usage model​

10) Capability‑based routing (not just topic→pool)​

Package usage model​

12) Replay + re‑run semantics​

Package usage model​

Pack‑Ready Hooks to Include Now (even before packages exist)

2) Schema validation as a first‑class primitive​

Package integration​

4) Runner profiles + constraints (without shipping workers)​

Package integration​

6) Immutable run/event log (append‑only timeline)​

Package integration​

8) Idempotency keys + dedupe across the control plane​

Package integration​

Don’t Add to Core (It Will Rot You)

How Packages Use Core Without Touching Core Code (Concrete Example: SRE Investigator)

Cordum Platform Core: Must‑Have Features (Pack‑Ready)

Principles​

Core is “boring”​

Packages are “installable behaviors”​

P0 — Non‑Negotiable Core

1) Stable protocol contract (jobs + results + pointers)​

How packages use this (without touching core)​

2) Workflow Engine as a deterministic state machine​

How packages use this (without touching core)​

3) Scheduler that is purely config‑driven (no hardcoded “core topics”)​

How packages use this (without touching core)​

CAP v2.5.2 Protocol Features (Scheduler)​

4) Safety Kernel as the single Policy Decision Point​

How packages use this (without touching core)​

5) Config Service with overlays (the pack hook)​

How packages use this (without touching core)​

6) Worker Runtime SDK (even if you ship zero workers)​

How packages use this (without touching core)​

7) Control‑plane APIs (gateway) that are pack‑agnostic​

How packages use this (without touching core)​

P1 — Needed for first real packages

8) Artifact store abstraction (Redis now, pluggable later)​

Package usage model​

9) Secrets reference model (never pass secrets as plaintext)​

Package usage model​

10) Capability‑based routing (not just topic→pool)​

Package usage model​

11) Budgets + quotas (enforced by kernel + scheduler)​

Package usage model​

12) Replay + re‑run semantics​

Package usage model​

P2 — Enterprise‑grade

13) Identity + tenancy model that won’t paint you into a corner​

14) Full observability​

15) Versioned migrations and backward compat​

16) Enterprise licensing and entitlements​

Pack‑Ready Hooks to Include Now (even before packages exist)

Extra Core Primitives (High Leverage, Still Platform‑Pure)

1) Policy “explain” + “simulate” APIs (security teams will demand this)​

Package integration​

2) Schema validation as a first‑class primitive​

Package integration​

3) Resource locks / concurrency guards (prevents chaos)​

Package integration​

4) Runner profiles + constraints (without shipping workers)​

Package integration​

5) Artifact store abstraction + retention classes​

Package integration​

6) Immutable run/event log (append‑only timeline)​

Package integration​

7) First‑class budgets enforced by kernel + scheduler​

Package integration​

8) Idempotency keys + dedupe across the control plane​

Principles

Core is “boring”

Packages are “installable behaviors”

2) Workflow Engine as a deterministic state machine

How packages use this (without touching core)

4) Safety Kernel as the single Policy Decision Point

How packages use this (without touching core)

6) Worker Runtime SDK (even if you ship zero workers)

How packages use this (without touching core)

8) Artifact store abstraction (Redis now, pluggable later)

Package usage model

10) Capability‑based routing (not just topic→pool)

Package usage model

12) Replay + re‑run semantics

Package usage model

2) Schema validation as a first‑class primitive

Package integration

4) Runner profiles + constraints (without shipping workers)

Package integration

6) Immutable run/event log (append‑only timeline)

Package integration

8) Idempotency keys + dedupe across the control plane

Package integration

Principles

Core is “boring”

Packages are “installable behaviors”

1) Stable protocol contract (jobs + results + pointers)

How packages use this (without touching core)

2) Workflow Engine as a deterministic state machine

How packages use this (without touching core)

3) Scheduler that is purely config‑driven (no hardcoded “core topics”)

How packages use this (without touching core)

CAP v2.5.2 Protocol Features (Scheduler)

4) Safety Kernel as the single Policy Decision Point

How packages use this (without touching core)

5) Config Service with overlays (the pack hook)

How packages use this (without touching core)

6) Worker Runtime SDK (even if you ship zero workers)

How packages use this (without touching core)

7) Control‑plane APIs (gateway) that are pack‑agnostic

How packages use this (without touching core)

8) Artifact store abstraction (Redis now, pluggable later)

Package usage model

9) Secrets reference model (never pass secrets as plaintext)

Package usage model

10) Capability‑based routing (not just topic→pool)

Package usage model

11) Budgets + quotas (enforced by kernel + scheduler)

Package usage model

12) Replay + re‑run semantics

Package usage model

13) Identity + tenancy model that won’t paint you into a corner

14) Full observability

15) Versioned migrations and backward compat

16) Enterprise licensing and entitlements

1) Policy “explain” + “simulate” APIs (security teams will demand this)

Package integration

2) Schema validation as a first‑class primitive

Package integration

3) Resource locks / concurrency guards (prevents chaos)

Package integration

4) Runner profiles + constraints (without shipping workers)

Package integration

5) Artifact store abstraction + retention classes

Package integration

6) Immutable run/event log (append‑only timeline)

Package integration

7) First‑class budgets enforced by kernel + scheduler

Package integration

8) Idempotency keys + dedupe across the control plane

Package integration

9) Ops surfaces (CLI + optional dashboard)

Package integration