📄 External Configuration System - Solution Design Document¶

SDS-MVP Overview – External Configuration System (ECS)¶

🎯 Context¶

The External Configuration System (ECS) is ConnectSoft’s Configuration-as-a-Service (CaaS) solution. It provides a centralized, secure, and tenant-aware configuration backbone for SaaS and microservice ecosystems, built on ConnectSoft principles of Clean Architecture, DDD, cloud-native design, event-driven mindset, and observability-first execution.

ECS addresses the operational complexity of managing configuration at scale, offering:

Centralized registry with versioning and rollback.
Tenant and edition-aware overlays.
Refreshable and event-driven propagation.
Multi-provider adapters (Azure AppConfig, AWS AppConfig, Redis, SQL/CockroachDB, Consul).
REST/gRPC APIs, SDKs, and a Config Studio UI.

📦 Scope of MVP¶

The MVP will focus on delivering the core configuration lifecycle:

Core Services
- Config Registry: Authoritative store of configuration items, trees, and bundles.
- Policy Service: Tenant, edition, and environment overlays with validation rules.
- Refresh Service: Event-driven propagation to SDKs and client services.
- Adapter Hub: Extensible connectors to external providers.
Access Channels
- APIs: REST + gRPC endpoints with OpenAPI/gRPC contracts.
- SDKs: .NET, JS, Mobile libraries with caching and refresh subscription.
- Config Studio: UI for admins, developers, and viewers with approval workflows.
Foundational Layers
- Security-first: OIDC/OAuth2, RBAC, edition-aware scoping, secret management.
- Observability-first: OTEL spans, metrics, structured logs, trace IDs.
- Cloud-native: Containerized, autoscaling, immutable deployments.
- Event-driven: CloudEvents-based change propagation.

🧑‍🤝‍🧑 Responsibilities by Service¶

Service	Responsibility
Config Registry	CRUD, versioning, history, rollback of configuration.
Policy Service	Edition/tenant overlays, RBAC enforcement, schema validation.
Refresh Service	Emits `ConfigChanged` events, ensures idempotent propagation.
Adapter Hub	Provider adapters for Azure AppConfig, AWS AppConfig, Redis, SQL, Consul.
Gateway/API	Auth, routing, rate limits, tenant scoping.
SDKs	Local caching, refresh subscription, fallback logic.
Config Studio	Admin UX, workflow approvals, auditing, policy management.

🗺️ High-Level Service Map¶

flowchart TD
    ClientApp[Client Apps / SDKs]
    Studio[Config Studio UI]
    Gateway[API Gateway]

    Registry[Config Registry]
    Policy[Policy Service]
    Refresh[Refresh Service]
    Adapter[Adapter Hub]

    ClientApp -->|REST/gRPC| Gateway
    Studio -->|Admin APIs| Gateway
    Gateway --> Registry
    Gateway --> Policy
    Registry --> Refresh
    Refresh --> ClientApp
    Registry --> Adapter
    Adapter --> External[Azure/AWS/Redis/SQL/Consul]

Hold "Alt" / "Option" to enable pan & zoom

🏢 Tenant and Edition Model¶

Multi-Tenant Isolation: Every tenant has an isolated configuration namespace, scoped across DB, cache, and API.
Edition Overlays: Global → Edition → Tenant → Service → Instance inheritance chain. Editions (e.g., Free, Pro, Enterprise) overlay baseline config with restrictions and feature toggles.
Environments: Dev, Stage, Prod environments are first-class entities, enforced via approval workflows in Studio.
RBAC Roles:
- Admin: Full control of config + secrets.
- Developer: Can manage configs but not override edition-level rules.
- Viewer: Read-only access.

📊 Quality Targets¶

Area	Target
Availability	99.9% SLA (single-region MVP), roadmap for multi-region HA.
Latency	< 50ms config retrieval via API or SDK cache.
Scalability	100 tenants, 10k config items per tenant in MVP scope.
Security	Zero Trust, tenant isolation, no cross-tenant leaks.
Observability	100% trace coverage with `traceId`, `tenantId`, `editionId`.
Resilience	Local SDK caching, retry + at-least-once event delivery.
Extensibility	Modular provider adapters and schema-based validation.

✅ Solution Architect Notes¶

MVP services must be scaffolded using ConnectSoft.MicroserviceTemplate for uniformity.
Registry persistence in CockroachDB with tenant-aware schemas.
Hot-path caching via Redis with per-tenant keyspaces.
API Gateway secured with OpenIddict / OAuth2 and workload identities.
Events propagated via MassTransit over Azure Service Bus (default) with DLQ and replay.
SDKs should include offline-first fallback for resilience.
Next solution cycles should expand into API contracts, event schemas, and deployment topologies.

Service Decomposition & Boundaries — ECS Core¶

This section defines the solution‑level decomposition and interaction boundaries for ECS MVP services: Config Registry, Policy Engine, Refresh Orchestrator, Adapter Hub, Config Studio UI, Auth, and API Gateway. It is implementation‑ready for Engineering/DevOps teams using ConnectSoft templates.

🧩 Components & Responsibilities¶

Component	Primary Responsibilities	Owns	External Calls	Exposes
API Gateway	Routing, authN/authZ enforcement, rate limiting, tenant routing, request shaping, canary/blue‑green	N/A	Auth, Core services	REST/gRPC north‑south edge
Auth (OIDC/OAuth2)	Issuer, scopes, service‑to‑service mTLS, token introspection, role->scope mapping	Identity config	N/A	OIDC endpoints, JWKS
Config Registry	CRUD of config items/trees/bundles; immutable versions; diffs; rollback; audit append	Registry schema in CockroachDB; audit log	Redis (read‑through), Event Bus (emit), DB	REST/gRPC: configs, versions, diffs, rollback
Policy Engine	Schema validation (JSON Schema), edition/tenant/env overlays, approval policy evaluation	Policy rules, schemas	Registry (read), Event Bus (emit on policy change)	REST: validate, resolve preview; Policy mgmt APIs
Refresh Orchestrator	Emits ConfigPublished events; fan‑out invalidations; long‑poll/WebSocket channel mgmt	Delivery ledger (idempotency), consumer offsets	Event Bus (consume/produce), Redis (invalidate)	Event streams; refresh channels
Adapter Hub	Pluggable connectors to Azure AppConfig/AWS AppConfig/Consul/Redis/SQL; mirror/sync	Adapter registrations; provider cursors	Providers; Registry (write via API), Bus (emit)	REST: adapter mgmt; background sync jobs
Config Studio UI	Admin UX, drafts, reviews, approvals, diffs, audit viewer, policy editor	N/A (UI only)	Gateway, Auth	Browser app; Admin APIs via Gateway

Tech Baseline (all core services): .NET 9, ConnectSoft.MicroserviceTemplate, OTEL, HealthChecks, MassTransit, Redis, CockroachDB (NHibernate for Registry, Dapper allowed in Adapters).

🧭 Bounded Contexts & Ownership¶

Bounded Context	Aggregate Roots	Key Invariants	Who Can Change
Registry	`ConfigItem`, `ConfigVersion`, `Bundle`, `AuditEntry`	Versions are append‑only; rollback creates a new version; paths unique per `(tenant, env)`	Studio (human), Adapter Hub (automated), CI (automation)
Policy	`PolicyRule`, `Schema`, `EditionOverlay`	Resolve = Base ⊕ Edition ⊕ Tenant ⊕ Service ⊕ Instance; schema must validate pre‑publish	Studio (policy admins)
Propagation	`Delivery`, `Subscription`	At‑least‑once; idempotency by `(tenant, path, version, etag)`	Refresh Orchestrator
Adapters	`ProviderRegistration`, `SyncCursor`, `Mapping`	No direct DB writes; must use Registry APIs; deterministic mapping	Adapter Hub

🔐 Tenancy & Scope Enforcement¶

Layer	How Enforced	Notes
Gateway	JWT claims: `tenant_id`, `scopes[]`; per‑tenant rate limits; route constraints	Reject cross‑tenant routes
Services	Multi‑tenant keyspace (`tenant:{id}:...`) for cache; DB row filters; repository guards	Tenancy middleware in template
Events	CloudEvents extensions: `tenantId`, `editionId`, `environment`	Dropped if missing/invalid

🔄 Interaction Rules (Who May Call Whom)¶

Studio UI → Gateway → {Registry, Policy}
Adapters → Gateway → Registry (never direct DB access)
Policy ↔ Registry (read for validation & preview only; Registry calls Policy for validation)
Registry → Refresh Orchestrator (emit) → Event Bus → SDKs/Consumers
Orchestrator → Redis (targeted invalidations)
Gateway ↔ Auth (token validation/JWKS)

Forbidden: Cross‑context writes (e.g., Policy writing Registry tables), Adapters writing DB directly, services bypassing Gateway for north‑south.

📡 Contracts — Endpoints & Events (MVP surface)¶

REST (Gateway‑fronted)

POST /configs/{path}:save-draft
POST /configs/{path}:publish
GET /configs/{path}:resolve?version|latest&env&service&instance (ETag, If‑None‑Match)
GET /configs/{path}/diff?fromVersion&toVersion
POST /policies/validate (body: config draft + context)
PUT /policies/rules/{id} / PUT /policies/editions/{editionId}
POST /adapters/{providerId}:sync / GET /adapters

gRPC (low‑latency SDK)

ResolveService.Resolve(ConfigResolveRequest) → stream / unary
RefreshChannel.Subscribe(Subscription) → server stream

CloudEvents (Event Bus topics)

ecs.config.v1.ConfigDraftSaved
ecs.config.v1.ConfigPublished
ecs.policy.v1.PolicyUpdated
ecs.adapter.v1.SyncCompleted
ecs.refresh.v1.CacheInvalidated

Event fields (common): id, source, type, time, dataRef?; ext: tenantId, editionId, environment, path, version, etag.

🗺️ Component Diagram¶

flowchart LR
  subgraph Edge
    GW[API Gateway]
    AUTH[Auth (OIDC)]
  end

  subgraph Core
    REG[Config Registry]
    POL[Policy Engine]
    REF[Refresh Orchestrator]
    ADP[Adapter Hub]
  end

  SDK[SDKs/.NET JS Mobile]
  UI[Config Studio UI]
  BUS[(Event Bus)]
  REDIS[(Redis)]
  CRDB[(CockroachDB)]
  EXT[Ext Providers: Azure/AWS/Consul/Redis/SQL]

  UI-->GW
  SDK-->GW
  GW-->REG
  GW-->POL
  REG-- read/write -->CRDB
  REG-- emit -->REF
  REG-- hot read -->REDIS
  REF-- publish -->BUS
  BUS-- notify -->SDK
  REF-- targeted del -->REDIS
  ADP-- API -->GW
  ADP-- sync -->EXT
  GW-- JWT/JWKS -->AUTH

Hold "Alt" / "Option" to enable pan & zoom

📚 Key Sequences¶

1) Resolve (read hot path)¶

sequenceDiagram
  participant SDK
  participant GW
  participant REG as Registry
  participant RED as Redis
  participant POL as Policy
  SDK->>GW: GET /configs/{path}:resolve (If-None-Match: etag)
  GW->>RED: GET tenant:path:etag
  alt Cache hit & ETag matches
    RED-->>GW: 304
    GW-->>SDK: 304 Not Modified
  else Miss or ETag mismatch
    GW->>REG: resolve(path, ctx)
    REG->>POL: validate+overlay(draft/version, ctx)
    POL-->>REG: resolved value + etag
    REG->>RED: SET tenant:path -> value, etag, ttl
    REG-->>GW: 200 + value + ETag
    GW-->>SDK: 200 + value + ETag
  end

Hold "Alt" / "Option" to enable pan & zoom

2) Publish (write warm path)¶

sequenceDiagram
  participant UI as Studio UI
  participant GW
  participant REG as Registry
  participant REF as Orchestrator
  participant BUS as Event Bus
  participant RED as Redis
  UI->>GW: POST /configs/{path}:publish (Idempotency-Key)
  GW->>REG: publish(path, draftId)
  REG-->>REG: create immutable version, append audit
  REG->>RED: DEL tenant:path:* (scoped)
  REG->>REF: emit ConfigPublished(tenant, path, version, etag)
  REF->>BUS: publish CloudEvent
  BUS-->>SDK: wake/notify
  UI-->>UI: 200 Published(version, etag)

Hold "Alt" / "Option" to enable pan & zoom

3) Adapter Sync (ingest)¶

sequenceDiagram
  participant ADP as Adapter Hub
  participant EXT as Ext Provider
  participant GW
  participant REG as Registry
  ADP->>EXT: fetch changes(since cursor)
  EXT-->>ADP: items delta
  ADP-->>ADP: map/transform (deterministic)
  ADP->>GW: POST /configs/{path}:save-draft (service identity)
  ADP->>GW: POST /configs/{path}:publish
  GW->>REG: publish(...)
  REG-->>ADP: ack + new cursor

Hold "Alt" / "Option" to enable pan & zoom

📈 NFRs by Service (Solution Targets)¶

Service	Latency	Throughput	Scaling	Storage
Gateway	< 5ms overhead	5k RPS	HPA by RPS & p99	N/A
Registry	p99 < 20ms (cached), < 50ms (DB)	2k RPS read / 200 RPS write	HPA by CPU + DB queue	CockroachDB
Policy	p99 < 15ms	1k RPS	Co‑locate with Registry; cache schemas	Internal store
Orchestrator	< 100ms publish	5k msg/s	Scale by bus lag (KEDA)	Small ledger
Adapter Hub	N/A interactive	bursty sync	Per‑adapter workers	Provider cursors
Redis	p99 < 2ms	>50k ops/s	Clustered shards	In‑memory

🚦 Failure Modes & Backpressure¶

Scenario	Behavior	Operator Signal
Registry DB degraded	Serve from Redis cache; reject writes with `503 RETRY`	Alert: p99 DB > threshold
Event bus outage	Buffer in Orchestrator (bounded); degrade to polling	Alert: bus lag + DLQ growth
Adapter provider slow	Backoff + skip tenant slice; do not block core	Adapter sync error rate
Policy validation fail	Block publish; keep draft; return violations	Policy violation dashboard
Cache stampede	Single‑flight per `(tenant,path)`; stale‑while‑revalidate	Cache hit ratio dip

Idempotency for publishes: Idempotency-Key header → (tenant, path, hash(body)) stored for TTL to prevent duplicates.

🚀 Deployability & Operability¶

One container per service, health probes: /health/live, /health/ready; startup probe for Registry migrations.
Config via ECS itself (bootstrapped defaults from environment), secrets via KeyVault/Key Management.
Dashboards: golden signals per service; propagation lag, cache hits, publish failure rate.
Policies as code: policy repo with CI validation; signed artifacts promoted with environment gates.

✅ Solution Architect Notes¶

Prefer NHibernate in Registry for aggregate invariants; use Dapper inside Adapter workers for raw mapping speed.
Default bus: Azure Service Bus; keep RabbitMQ option in template flags.
SDKs ship with long‑poll by default; enable WS later behind a feature flag.
Hard rule: Adapters must use public APIs (no private shortcuts) to keep invariants centralized in Registry.
Next, solidify OpenAPI/gRPC contracts and CloudEvents schemas to unblock parallel implementation.

API Gateway & AuthN/AuthZ — Edge Design for ECS¶

The ECS edge enforces identity, tenant isolation, and traffic governance before requests reach core services. This section defines the Envoy/YARP gateway design, OIDC flows, scope/role model, tenant routing, and rate limiting used by:

Config Studio (SPA)
Public ECS APIs/SDKs (REST/gRPC)
Service-to-service (internal)
Adapters & webhooks

Outcomes: Zero‑trust ingress, edition-aware routing, consistent scopes, and predictable throttling per tenant/plan.

Edge Components & Trust Boundaries¶

flowchart LR
  subgraph PublicZone[Public Zone]
    Client[Browsers/SDKs/CLI]
    WAF[WAF/DoS Shield]
    Envoy[Envoy Gateway]
  end

  subgraph ServiceZone[Service Mesh / Internal Zone]
    YARP[YARP Internal Gateway]
    AuthZ[Ext AuthZ/Policy PDP]
    API[Config API]
    Policy[Policy Engine]
    Audit[Audit Service]
    Events[Event Publisher]
  end

  IdP[(OIDC Provider)]
  JWKS[(JWKS Cache)]
  RLS[(Rate Limit Service)]

  Client --> WAF --> Envoy
  Envoy -->|JWT verify| JWKS
  Envoy -->|ext_authz| AuthZ
  Envoy -->|quota| RLS
  Envoy --> YARP
  YARP --> API
  YARP --> Policy
  YARP --> Audit
  Envoy <-->|OAuth/OIDC| IdP

Hold "Alt" / "Option" to enable pan & zoom

Patterns

Envoy is the internet-facing gateway (JWT, mTLS upstream, rate limit, ext_authz).
YARP is the east–west/internal router (BFF for Studio, granular routing, canary, sticky reads).
AuthZ PDP centralizes fine-grained decisions (scope→action→resource→tenant) using PDP/OPA or Policy Engine APIs.

OIDC Flows & Token Types¶

Client/Use Case	Grant / Flow	Token Audience (`aud`)	Notes
Config Studio (SPA)	Auth Code + PKCE	`ecs.api`	Implicit denied; refresh via `offline_access` (rotating refresh).
Machine-to-Machine (services)	Client Credentials	`ecs.api`	Used by backend jobs, adapters, CI/CD.
SDK in customer service	Token Exchange (RFC8693) or Client Credentials	`ecs.api`	Exchange SaaS identity→ECS limited token; preserves `act` chain.
Support/Break-glass (time-box)	Device Code / Auth Code	`ecs.admin`	Short TTL, extra policy gates + audit.
Webhooks / legacy integrators	HMAC API Key (fallback)	n/a	Signed body + timestamp; scoped to tenant + resources; rotateable.

Sequence: SPA (Auth Code + PKCE)

sequenceDiagram
  participant B as Browser (SPA)
  participant G as Envoy
  participant I as OIDC Provider
  participant Y as YARP
  participant A as Config API

  B->>G: GET /studio
  G-->>B: 302 → /authorize (PKCE)
  B->>I: /authorize?code_challenge...
  I-->>B: 302 → /callback?code=...
  B->>G: /callback?code=...
  G->>I: /token (code+verifier)
  I-->>G: id_token + access_token(jwt) + refresh_token
  G->>Y: /api/configs (Authorization: Bearer ...)
  Y->>A: forward (mTLS)
  A-->>Y: 200
  Y-->>G: 200
  G-->>B: 200 + content

Hold "Alt" / "Option" to enable pan & zoom

Sequence: S2S (Client Credentials)

sequenceDiagram
  participant Job as Worker/Adapter
  participant I as OIDC Provider
  participant G as Envoy
  participant P as Policy Engine
  participant A as Config API

  Job->>I: POST /token (client_credentials, scopes)
  I-->>Job: access_token(jwt)
  Job->>G: API call + Bearer jwt
  G->>P: ext_authz {sub, scopes, tenant_id, action, resource}
  P-->>G: ALLOW (obligations: edition, rate_tier)
  G->>A: forward (mTLS)
  A-->>G: 200
  G-->>Job: 200

Hold "Alt" / "Option" to enable pan & zoom

Scopes, Roles & Permissions¶

Scope Catalogue (prefix `ecs.`)¶

Scope	Purpose	Sample Actions
`config.read`	Read effective/declared config	GET `/configs`, `/resolve`
`config.write`	Modify config (CRUD, rollout)	POST/PUT `/configs`, `/rollback`
`policy.read`	Read policy artifacts	GET `/policies`
`policy.write`	Manage policies, schemas	PUT `/policies`, `/schemas`
`audit.read`	Read audit & diffs	GET `/audit/*`
`snapshot.manage`	Import/export snapshots	POST `/snapshots/export`
`adapter.manage`	Manage provider connectors	POST `/adapters/*`
`tenant.admin`	Tenant-level admin ops	Keys, members, editions

Scopes are granted per tenant, optional qualifiers: tenant:{id}, env:{name}, app:{id}, region:{code}.

Role→Scope Mapping (default policy)¶

Role	Scopes
`tenant-reader`	`config.read`, `audit.read`
`tenant-editor`	`config.read`, `config.write`, `audit.read`
`tenant-admin`	All above + `policy.read`, `policy.write`, `snapshot.manage`, `adapter.manage`, `tenant.admin`
`platform-admin`	Cross-tenant (requires `x-tenant-admin=true` claim + break-glass policy)
`support-operator`	Time-boxed, read-most, write via approval policy

Claims required on JWT

sub, iss, aud, exp, iat
tenant_id (or tenants array for delegated tools)
edition_id, env, plan_tier
scopes (space-separated)
Optional: act (actor chain), delegated=true, region, app_id

Decision rule: ALLOW = jwt.valid ∧ aud∈{ecs.api,ecs.admin} ∧ scope→action ∧ resource.scope.includes(tenant_id/env/app) ∧ edition.allows(action)

Tenant & Edition Routing¶

Resolution order (first hit wins):

Host-based: {tenant}.ecs.connectsoft.cloud → tenant_id={tenant}
Header: X-Tenant-Id: {tenant_id} (required for M2M/SDK if no host binding)
Path prefix: /t/{tenant_id}/... (CLI & bulk ops)

Validation

Verify tenant exists & active in Tenant Registry.
Map edition and plan tier to policy & rate tiers.
Enforce data residency: route to region cluster (EU/US/APAC).
Attach routing headers: x-tenant-id, x-edition-id, x-plan-tier, x-region.

Envoy route (illustrative)

- match: { prefix: "/api/" }
  request_headers_to_add:
    - header: { key: x-tenant-id, value: "%REQ(X-Tenant-Id)%" }
  typed_per_filter_config:
    envoy.filters.http.jwt_authn:
      requirement_name: ecs_jwt
  route:
    cluster: ecs-region-%DYNAMIC_REGION_FROM_TENANT%

YARP internal routes (illustrative)

{
  "Routes": [
    { "RouteId": "cfg", "Match": { "Path": "/api/configs/{**catch}" }, "ClusterId": "config-api" },
    { "RouteId": "pol", "Match": { "Path": "/api/policies/{**catch}" }, "ClusterId": "policy-api" }
  ],
  "Clusters": {
    "config-api": { "Destinations": { "d1": { "Address": "https://config-api.svc.cluster.local" } } },
    "policy-api":  { "Destinations": { "d1": { "Address": "https://policy-api.svc.cluster.local" } } }
  }
}

Rate Limiting & Quotas¶

Algorithms: Token Bucket at Envoy (global), local leaky-bucket per worker; descriptors include tenant_id, scope, route, plan_tier.

Default Tiers (edition-aware)¶

Edition	Read RPS (burst)	Write RPS (burst)	Events/min	Webhook deliveries/min	Notes
Starter	150 (600)	10 (30)	600	120	Best-effort propagation
Pro	600 (2,400)	40 (120)	3,000	600	Priority queueing
Enterprise	2,000 (8,000)	120 (360)	12,000	2,400	Dedicated partitions, regional failover priority

429 Behavior

Retry-After header with bucket ETA.
Audit an RateLimitExceeded event (tenant-scoped).
SDKs respect backoff jittered-expo and honor Retry-After.

Envoy RLS descriptors

domain: ecs
descriptors:
  - key: tenant_id
    descriptors:
      - key: scope
        rate_limit:
          unit: second
          requests_per_unit: 600   # Pro read default

Security Controls at Edge¶

JWT verification at Envoy (per-tenant JWKS cache, kid pinning, TTL ≤ 10m).
ext_authz callout to PDP for ABAC/RBAC decision + obligations (edition, plan).
mTLS between Envoy↔YARP↔services (SPIFFE/SPIRE or workload identity).
CORS locked to Studio & allowed origins (configurable per tenant).
HMAC API Keys for webhooks: header X-ECS-Signature: sha256=... over canonical payload + ts.
WAF/DoS: IP reputation, geo-fencing by residency, request size caps, JSON schema guard on public write endpoints.
Key rotation: JWKS rollover, client secret rotation, API key rotation (max 90d).

Error Model & Observability¶

Errors

401 invalid/expired tokens → WWW-Authenticate: Bearer error="invalid_token".
403 valid token but insufficient scope/tenant.
429 throttled with Retry-After.
400 schema violations; 415 content-type errors.

Problem Details (RFC 7807): return type, title, detail, traceId, tenant_id.

Telemetry

OTEL spans at Envoy & YARP; propagate traceparent, x-tenant-id.
Metrics: requests_total{tenant,route,scope}, authz_denied_total, jwks_refresh_seconds, 429_total.
Logs: structured, redacted; audit who/what/when/why for authz decisions.

Configuration-as-Code (GitOps)¶

Envoy & YARP configs templated, per environment & region; changes gate via PR + canary.
Policy bundles versioned (OPA or Policy Engine exports).
Rate-limit tiers read from ECS system config (edition-aware) with hot reload.

Reference Implementations¶

Envoy (JWT + ext_authz + RLS excerpt)¶

http_filters:
  - name: envoy.filters.http.jwt_authn
    typed_config:
      providers:
        ecs:
          issuer: "https://id.connectsoft.cloud"
          audiences: ["ecs.api","ecs.admin"]
          remote_jwks: { http_uri: { uri: "https://id.connectsoft.cloud/.well-known/jwks.json", cluster: idp }, cache_duration: 600s }
      rules:
        - match: { prefix: "/api/" }
          requires: { provider_name: "ecs" }

  - name: envoy.filters.http.ext_authz
    typed_config:
      grpc_service: { envoy_grpc: { cluster_name: pdp } }
      transport_api_version: V3
      failure_mode_allow: false

  - name: envoy.filters.http.ratelimit
    typed_config:
      domain: "ecs"
      rate_limit_service: { grpc_service: { envoy_grpc: { cluster_name: rls } } }

YARP (authZ pass-through + canary)¶

{
  "ReverseProxy": {
    "Transforms": [
      { "RequestHeaderOriginalHost": "true" },
      { "X-Forwarded": "Append" }
    ],
    "SessionAffinity": { "Policy": "Cookie", "AffinityKeyName": ".ecs.aff" },
    "HealthChecks": { "Passive": { "Enabled": true } }
  }
}

SDK & Client Guidance (Edge Contracts)¶

Send tenant: Prefer host-based; otherwise send X-Tenant-Id.
Set audience: aud=ecs.api; request only necessary scopes.
Respect 429: backoff with jitter; honor Retry-After.
Propagate trace: include traceparent for cross-service correlation.
Cache: ETag/If-None-Match on read endpoints.

Operational Runbook (Edge)¶

Rollover JWKS: stage new keys, overlap 24h, monitor jwt_authn_failed.
Policy hotfix: push PDP bundle; verify authz_denied_total regression.
Canary: 5% traffic to new Envoy/YARP images; roll forward on SLO steady 30m.
Tenant cutover (region move): drain + update registry → immediate routing update.
Incident: spike in 429 → inspect descriptor stats; temporarily bump burst for affected tenant via override CRD.

Solution Architect Notes¶

Decision pending: PDP choice (OPA sidecar vs. centralized Policy Engine API). Prototype both; target <5 ms p95 decision time.
Token exchange: Add formal RFC 8693 support to IdP for SDK delegation chains (act claim preservation).
Multi‑IdP federation: Map external AAD/Okta groups → ECS roles via claims transformation at IdP.
Per‑tenant custom domains: Support config.{customer}.com via TLS cert automation (ACME) and SNI routing.
Quota analytics: Expose per‑tenant usage API (current window, limit, projected exhaustion) to Studio.

This design enables secure, tenant-aware, edition-governed access with predictable performance and operational clarity from the very first release.

Public REST API (OpenAPI) – endpoints, DTOs, error model, idempotency, pagination, filtering¶

Scope & positioning¶

This section specifies the public, multi-tenant REST API for ECS v1, including resource model, endpoint surface, DTOs, error envelope, and cross‑cutting behaviors (auth, idempotency, pagination, filtering, ETags). It is implementation‑ready for Engineering/SDK agents and aligns with Gateway/Auth design already defined.

Principles

Versioned URIs (/api/v1/...) and semantic versions in x-api-version response header.
Tenant‑scoped by token (preferred) with optional admin paths that accept {tenantId}.
OAuth2/OIDC (JWT Bearer) with scopes: ecs.read, ecs.write, ecs.admin.
Problem Details (RFC 7807) with ECS extensions for all non‑2xx results.
ETag + If‑Match for optimistic concurrency on mutable resources.
Idempotency-Key for POST that create/trigger side effects.
Cursor pagination, stable sorting, safe filtering DSL.
Observability-first: traceparent, x-correlation-id, x-tenant-id (echo), x-request-id.

Core resource model¶

Resource	Purpose	Key fields
ConfigSet	Logical configuration package (name, app, env targeting, tags)	`id`, `name`, `appId`, `labels[]`, `status`, `createdAt`, `updatedAt`, `etag`
ConfigItem	Key/value (JSON/YAML) entries within a ConfigSet	`key`, `value`, `contentType`, `isSecret`, `labels[]`, `etag`
Snapshot	Immutable, signed version of a ConfigSet	`id`, `configSetId`, `version`, `createdBy`, `hash`, `note`
Deployment	Promotion of a Snapshot to an environment/segment	`id`, `snapshotId`, `environment`, `status`, `policyEval`, `startedAt`, `completedAt`
Policy	Targeting, validation, transform rules	`id`, `kind`, `expression`, `enabled`
Refresh	Push/notify clients to reload (per set/tenant/app)	`id`, `scope`, `status`

Endpoint surface (v1)¶

All paths are relative to /api/v1. Admin endpoints are prefixed with /admin/tenants/{tenantId} when cross‑tenant ops are required.

Config sets¶

Method	Path	Scope	Notes
POST	`/config-sets`	`ecs.write`	Create (idempotent via `Idempotency-Key`)
GET	`/config-sets`	`ecs.read`	List with cursor pagination, filtering, sorting
GET	`/config-sets/{setId}`	`ecs.read`	Retrieve
PATCH	`/config-sets/{setId}`	`ecs.write`	Partial update (If‑Match required)
DELETE	`/config-sets/{setId}`	`ecs.write`	Soft delete (If‑Match required)

Items (within a set)¶

Method	Path	Scope	Notes
PUT	`/config-sets/{setId}/items/{key}`	`ecs.write`	Upsert single (If‑Match optional on update)
GET	`/config-sets/{setId}/items/{key}`	`ecs.read`	Get single
DELETE	`/config-sets/{setId}/items/{key}`	`ecs.write`	Delete single (If‑Match required)
POST	`/config-sets/{setId}/items:batch`	`ecs.write`	Bulk upsert/delete; idempotent with key
GET	`/config-sets/{setId}/items`	`ecs.read`	List/filter keys, supports `accept: application/yaml`

Snapshots & diffs¶

Method	Path	Scope	Notes
POST	`/config-sets/{setId}/snapshots`	`ecs.write`	Create snapshot (idempotent via content hash)
GET	`/config-sets/{setId}/snapshots`	`ecs.read`	List versions (cursor)
GET	`/config-sets/{setId}/snapshots/{snapId}`	`ecs.read`	Get snapshot metadata
GET	`/config-sets/{setId}/snapshots/{snapId}/content`	`ecs.read`	Materialized config (JSON/YAML)
POST	`/config-sets/{setId}/diff`	`ecs.read`	Compute diff of two snapshots or working set

Deployments & refresh¶

Method	Path	Scope	Notes
POST	`/deployments`	`ecs.write`	Deploy a snapshot to target; idempotent
GET	`/deployments`	`ecs.read`	List/filter by set, env, status
GET	`/deployments/{deploymentId}`	`ecs.read`	Status, policy results, logs cursor
POST	`/refresh`	`ecs.write`	Trigger refresh notifications (idempotent)

Policies¶

Method	Path	Scope	Notes
POST	`/config-sets/{setId}/policies`	`ecs.write`	Create policy
GET	`/config-sets/{setId}/policies`	`ecs.read`	List
GET	`/config-sets/{setId}/policies/{policyId}`	`ecs.read`	Get
PUT	`/config-sets/{setId}/policies/{policyId}`	`ecs.write`	Replace (If‑Match)
DELETE	`/config-sets/{setId}/policies/{policyId}`	`ecs.write`	Delete (If‑Match)

Cross‑cutting headers & behaviors¶

Header	Direction	Purpose
`Authorization: Bearer <jwt>`	In	OIDC; contains `sub`, `tid` (tenant), `scp` (scopes)
`Idempotency-Key`	In	UUIDv4 per create/trigger; 24h window, per tenant+route+body hash
`If-Match` / `ETag`	In/Out	Concurrency for updates/deletes; strong ETags on representations
`traceparent`	In	W3C tracing; echoed to spans
`x-correlation-id`	In/Out	Client-provided; echoed on response and logs
`x-api-version`	Out	Semantic API impl version
`RateLimit-*`	Out	From gateway (limits & reset)

DTOs (canonical schemas)¶

openapi: 3.1.0
info:
  title: ConnectSoft ECS Public API
  version: 1.0.0
servers:
  - url: https://api.connectsoft.com/ecs/api/v1
components:
  securitySchemes:
    oauth2:
      type: oauth2
      flows:
        clientCredentials:
          tokenUrl: https://auth.connectsoft.com/oauth2/token
          scopes:
            ecs.read: Read ECS resources
            ecs.write: Create/update ECS resources
            ecs.admin: Cross-tenant administration
  schemas:
    ConfigSet:
      type: object
      required: [id, name, appId, status, createdAt, updatedAt, etag]
      properties:
        id: { type: string, format: uuid }
        name: { type: string, maxLength: 120 }
        appId: { type: string }
        labels: { type: array, items: { type: string, maxLength: 64 } }
        status: { type: string, enum: [Active, Archived] }
        description: { type: string, maxLength: 1024 }
        createdAt: { type: string, format: date-time }
        updatedAt: { type: string, format: date-time }
        etag: { type: string }
    ConfigItem:
      type: object
      required: [key, value, contentType]
      properties:
        key: { type: string, maxLength: 256, pattern: "^[A-Za-z0-9:\\._\\-/]+$" }
        value: { oneOf: [ { type: object }, { type: array }, { type: string }, { type: number }, { type: boolean }, { type: "null" } ] }
        contentType: { type: string, enum: [application/json, application/yaml, text/plain] }
        isSecret: { type: boolean, default: false }
        labels: { type: array, items: { type: string } }
        etag: { type: string }
    Snapshot:
      type: object
      required: [id, configSetId, version, hash, createdAt]
      properties:
        id: { type: string, format: uuid }
        configSetId: { type: string, format: uuid }
        version: { type: string, pattern: "^v\\d+\\.\\d+\\.\\d+$" }
        note: { type: string, maxLength: 512 }
        hash: { type: string, description: "SHA-256 of materialized content" }
        createdBy: { type: string }
        createdAt: { type: string, format: date-time }
    Deployment:
      type: object
      required: [id, snapshotId, environment, status, startedAt]
      properties:
        id: { type: string, format: uuid }
        snapshotId: { type: string, format: uuid }
        environment: { type: string, enum: [dev, test, staging, prod] }
        status: { type: string, enum: [Queued, InProgress, Succeeded, Failed, Cancelled] }
        policyEval: { type: object, additionalProperties: true }
        startedAt: { type: string, format: date-time }
        completedAt: { type: string, format: date-time, nullable: true }
    Problem:
      type: object
      description: RFC 7807 with ECS extensions
      required: [type, title, status, traceId, code]
      properties:
        type: { type: string, format: uri }
        title: { type: string }
        status: { type: integer, minimum: 100, maximum: 599 }
        detail: { type: string }
        instance: { type: string }
        code: { type: string, description: "Stable machine error code" }
        traceId: { type: string }
        tenantId: { type: string }
        violations:
          type: array
          items: { type: object, properties: { field: {type: string}, message: {type: string}, code: {type: string} } }
    ListResponse:
      type: object
      properties:
        items: { type: array, items: { } } # overridden per path via allOf
        nextCursor: { type: string, nullable: true }
        total: { type: integer, description: "Optional total when cheap" }

Representative paths (excerpt)¶

paths:
  /config-sets:
    get:
      security: [{ oauth2: [ecs.read] }]
      parameters:
        - in: query; name: cursor; schema: { type: string }
        - in: query; name: limit; schema: { type: integer, minimum: 1, maximum: 200, default: 50 }
        - in: query; name: sort; schema: { type: string, enum: [name, createdAt, updatedAt] }
        - in: query; name: order; schema: { type: string, enum: [asc, desc], default: asc }
        - in: query; name: filter; schema: { type: string, description: "See Filtering DSL" }
      responses:
        "200":
          description: OK
          content:
            application/json:
              schema:
                allOf:
                  - $ref: "#/components/schemas/ListResponse"
                  - type: object
                    properties:
                      items:
                        type: array
                        items: { $ref: "#/components/schemas/ConfigSet" }
        default:
          description: Error
          content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } }
    post:
      security: [{ oauth2: [ecs.write] }]
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required: [name, appId]
              properties:
                name: { type: string }
                appId: { type: string }
                description: { type: string }
                labels: { type: array, items: { type: string } }
      parameters:
        - in: header; name: Idempotency-Key; required: true; schema: { type: string, format: uuid }
      responses:
        "201": { description: Created, headers: { ETag: { schema: { type: string } } }, content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }
        "409": { description: Conflict, content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } }

  /config-sets/{setId}:
    get:
      security: [{ oauth2: [ecs.read] }]
      parameters: [ { in: path, name: setId, required: true, schema: { type: string, format: uuid } } ]
      responses: { "200": { content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }, "404": { content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } } }
    patch:
      security: [{ oauth2: [ecs.write] }]
      parameters:
        - { in: path, name: setId, required: true, schema: { type: string, format: uuid } }
        - { in: header, name: If-Match, required: true, schema: { type: string } }
      requestBody:
        content:
          application/merge-patch+json:
            schema: { type: object, properties: { description: {type: string}, labels: { type: array, items: {type: string} }, status: {type: string, enum: [Active, Archived]} } }
      responses:
        "200": { headers: { ETag: { schema: { type: string } } }, content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }
        "412": { description: Precondition Failed, content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } }

  /config-sets/{setId}/items:batch:
    post:
      security: [{ oauth2: [ecs.write] }]
      parameters:
        - { in: path, name: setId, required: true, schema: { type: string, format: uuid } }
        - { in: header, name: Idempotency-Key, required: true, schema: { type: string, format: uuid } }
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                upserts: { type: array, items: { $ref: "#/components/schemas/ConfigItem" } }
                deletes: { type: array, items: { type: string, description: "keys to delete" } }
      responses:
        "200": { description: Applied, content: { application/json: { schema: { type: object, properties: { upserted: {type: integer}, deleted: {type: integer} } } } } }

Error model (Problem Details)¶

Media type: application/problem+json Base fields: type, title, status, detail, instance ECS extensions:

code – stable machine code (e.g., ECS.CONFLICT.ETAG_MISMATCH, ECS.VALIDATION.FAILED)
traceId – correlates with OTEL traces
tenantId
violations[] – field‑level errors (field, message, code)

Examples

409 Conflict (duplicate name): code=ECS.CONFLICT.DUPLICATE_NAME
412 Precondition Failed (ETag): code=ECS.CONFLICT.ETAG_MISMATCH
429 Too Many Requests (rate limit): code=ECS.RATE_LIMITED with RateLimit-* headers

Idempotency¶

Required on POST that create (/config-sets, /deployments, /refresh) or batch‑modify (items:batch) resources.
Key scope: {tenantId}|{route}|SHA256(body); window: 24 hours (configurable).
Behavior:
- First request persists idempotency record with response body, status, headers (including ETag, Location).
- Subsequent identical requests (same key) return cached response with Idempotent-Replay: true.
- Key collision with different body hash → 409 with code=ECS.IDEMPOTENCY.BODY_MISMATCH.
Safe methods (GET/HEAD) must not accept Idempotency-Key.

Pagination, sorting & filtering¶

Cursor pagination (default)¶

Query: cursor, limit (1–200; default 50).
Response: nextCursor (opaque). When null, end of page.
Optional total when cheap to compute; otherwise omitted.

Sorting¶

sort in a whitelist per resource (e.g., name|createdAt|updatedAt).
order: asc|desc (default asc).

Filtering DSL (simple, safe)¶

Single filter parameter using a constrained expression language:
- Grammar: expr := field op value; op := eq|ne|gt|lt|ge|le|in|like
- Conjunctions with and, or; parentheses supported.
- Values URL‑encoded; strings in single quotes.
Examples:
- filter=appId eq 'billing-service' and status eq 'Active'
- filter=labels in ('prod','blue')
- filter=updatedAt gt '2025-08-01T00:00:00Z'
Field allow‑list per resource to prevent injection.

Typical client flow (sequence)¶

sequenceDiagram
  participant Cli as Client
  participant GW as API Gateway
  participant API as ECS REST API
  participant PE as Policy Engine
  participant RO as Refresh Orchestrator

  Cli->>GW: POST /config-sets (Idempotency-Key)
  GW->>API: JWT(scopes=ecs.write, tid=...) + headers
  API-->>Cli: 201 ConfigSet (ETag)

  Cli->>GW: POST /config-sets/{setId}/items:batch (Idempotency-Key)
  GW->>API: Upserts & deletes
  API-->>Cli: 200 summary

  Cli->>GW: POST /config-sets/{setId}/snapshots (Idempotency-Key)
  GW->>API: create snapshot
  API-->>Cli: 201 Snapshot

  Cli->>GW: POST /deployments (Idempotency-Key)
  GW->>API: deploy snapshot to prod
  API->>PE: Evaluate policies
  API-->>Cli: 202 Deployment accepted

  Cli->>GW: POST /refresh (Idempotency-Key)
  GW->>RO: emit refresh signal
  RO-->>Cli: 202 Accepted

Hold "Alt" / "Option" to enable pan & zoom

Security & scopes quick map¶

Operation	Scope(s)
Read any resource	`ecs.read`
Create/update/delete set/items/snapshots/deployments/policies	`ecs.write`
Cross‑tenant admin (admin paths)	`ecs.admin`

Tenant is resolved from JWT tid claim; server rejects any cross‑tenant attempts on non‑admin routes.

SDK alignment (high‑level)¶

.NET/TypeScript SDKs will generate from this OpenAPI, exposing:
EcsClient with typed methods (createConfigSetAsync, listConfigSetsAsync…)
Optional resilience policies (retries on 409/412 with ETag refresh).
Built‑in Idempotency-Key generator and replay handling.
Paged iterators (for await (const set of client.configSets.listAll(filter))).

Solution Architect Notes¶

Filtering DSL: accept this simple grammar now, revisit OData/RSQL compatibility later if needed.
Idempotency window: proposed 24h; confirm compliance requirements (auditability vs storage pressure).
Max page size: 200 proposed; validate against expected cardinality of items per set.
Secrets: isSecret=true values are write‑only; reads must return masked; align with provider capabilities.
Multi‑format content: support YAML in Accept for content endpoints; default JSON elsewhere.
Admin paths: include /admin/tenants/{tenantId}/... only for ECS Ops/partners; hide from standard tenants.

Ready‑to‑build artifacts¶

OpenAPI 3.1 file (expand from excerpt).
Contract tests (SDK acceptance):
- Idempotency replay
- ETag/If‑Match conflict
- Cursor traversal
- Filter grammar parsing & allow‑list enforcement
Gateway route + scope policy generation from tags/paths.

gRPC Contracts — low‑latency reads & streaming refresh channels¶

Scope & goals¶

This section defines gRPC service contracts for low‑latency SDK/agent scenarios and push‑based refresh. It complements the REST API by optimizing the read hot path and near‑real‑time change delivery with backpressure, resumability, and idempotent semantics.

Outcomes

Unary Resolve with sub‑10 ms in‑cluster latency (cached).
Server/bidi Refresh streams with at‑least‑once delivery and client acks.
Error details via google.rpc.* for actionable retries/backoff.
Strong multi‑tenant isolation via metadata + request fields.

Transport & security assumptions¶

gRPC over HTTP/2, TLS required; mTLS for service‑to‑service.
OIDC JWT in authorization: Bearer <jwt> metadata; audience ecs.api.
Per‑call deadline required by clients; server enforces sane max (e.g., 5s for unary, 60m for streams).
Tenant context required via metadata x-tenant-id (validated against JWT) and echoed in responses.
OTEL propagation: grpc-trace-bin/traceparent metadata.

Packages & options¶

syntax = "proto3";

package connectsoft.ecs.v1;

option csharp_namespace = "ConnectSoft.Ecs.V1";
option go_package       = "github.com/connectsoft/ecs/api/v1;ecsv1";

import "google/protobuf/any.proto";
import "google/protobuf/struct.proto";
import "google/protobuf/timestamp.proto";
import "google/rpc/error_details.proto";

Services & RPCs (IDL)¶

// Low-latency, cached resolution of effective configuration.
service ResolveService {
  // Resolve a single key or a set path; supports conditional fetch with ETag.
  rpc Resolve(ResolveRequest) returns (ResolveResponse);

  // Batch resolve multiple keys/paths in one RPC (atomic per key).
  rpc ResolveBatch(ResolveBatchRequest) returns (ResolveBatchResponse);

  // Enumerate keys under a path with server-side paging.
  rpc ListKeys(ListKeysRequest) returns (ListKeysResponse);
}

// Push notifications for changes; supports server-stream and bidi with ACKs.
service RefreshChannel {
  // Simple server-stream subscription; ACKs are implicit by flow control.
  rpc Subscribe(Subscription) returns (stream RefreshEvent);

  // Resumable, at-least-once delivery with explicit ACKs on the same stream.
  rpc SubscribeWithAck(stream RefreshClientMessage)
      returns (stream RefreshServerMessage);
}

// (Optional/internal) admin operations for agents and adapters.
service InternalAdmin {
  rpc SaveDraft(SaveDraftRequest) returns (SaveDraftResponse);
  rpc Publish(PublishRequest) returns (PublishResponse);
  rpc Rollback(RollbackRequest) returns (RollbackResponse);
}

Messages (core)¶

// Common context for resolution.
message Context {
  string environment = 1;        // dev|test|staging|prod
  string app_id      = 2;        // logical application id
  string service     = 3;        // optional microservice id
  string instance    = 4;        // optional instance id
  string edition_id  = 5;        // optional, overrides tenant default
  map<string,string> labels = 10;// optional targeting labels
}

message ResolveRequest {
  string path             = 1;  // e.g., "apps/billing/db/connectionString"
  Context context         = 2;
  string if_none_match_etag = 3; // conditional fetch
  // If version unset, server uses "latest".
  string version          = 4;  // semantic or snapshot id
}

message ResolvedValue {
  google.protobuf.Value value = 1; // JSON value
  string content_type         = 2; // application/json|yaml|text/plain
}

message ResolveResponse {
  bool not_modified          = 1; // true when ETag matches
  string etag                = 2; // strong ETag of resolved content
  ResolvedValue resolved     = 3; // omitted when not_modified=true
  string provenance_version  = 4; // snapshot or semantic version
  google.protobuf.Timestamp expires_at = 5; // cache hint
  map<string,string> meta    = 9;  // server hints (e.g., "stale-while-revalidate":"2s")
}

message ResolveBatchRequest {
  repeated ResolveRequest requests = 1;
}

message ResolveBatchResponse {
  repeated ResolveResponse responses = 1; // 1:1 with requests (same order)
}

message ListKeysRequest {
  string path     = 1;  // list under this prefix
  Context context = 2;
  int32  page_size = 3; // 1..500, default 100
  string page_token = 4;// opaque
  // Optional filtering on labels and name
  string filter = 5;    // e.g., "name like 'conn%'" or "label in ('prod','blue')"
  string order_by = 6;  // "name asc|desc", "updated_at desc"
}

message KeyEntry {
  string key   = 1;
  string etag  = 2;
  google.protobuf.Timestamp updated_at = 3;
  repeated string labels = 4;
}

message ListKeysResponse {
  repeated KeyEntry items = 1;
  string next_page_token  = 2; // empty => end
}

Streaming refresh contracts¶

// What to watch.
message Subscription {
  // If omitted, server infers tenant from metadata and defaults to latest/any env.
  Context context = 1;
  // Paths to watch. Supports prefix semantics.
  repeated PathSelector selectors = 2;
  // Receive only specific event types.
  repeated EventType event_types = 3;
  // Resumability: provide last committed cursor to continue after reconnect.
  string resume_after_cursor = 4;
  // Heartbeat period requested by client (server may clamp).
  int32 heartbeat_seconds = 5;
}

message PathSelector {
  string value = 1;          // e.g., "apps/billing/**" or exact key
  SelectorType type = 2;     // EXACT|PREFIX|GLOB
}

enum SelectorType { SELECTOR_TYPE_UNSPECIFIED = 0; EXACT = 1; PREFIX = 2; GLOB = 3; }

enum EventType {
  EVENT_TYPE_UNSPECIFIED = 0;
  CONFIG_PUBLISHED = 1;    // new immutable version published
  CACHE_INVALIDATED = 2;   // targeted cache purge
  POLICY_UPDATED = 3;      // policy/edition overlay changed
}

message RefreshEvent {
  string event_id = 1;           // unique id for idempotency
  string cursor   = 2;           // monotonically increasing per-tenant offset
  EventType type  = 3;
  string path     = 4;           // affected key/prefix
  string version  = 5;           // snapshot/semantic version
  string etag     = 6;           // new etag for resolve
  google.protobuf.Timestamp time = 7;
  map<string,string> meta = 8;   // e.g., reason, actor, correlation id
}

message RefreshClientMessage {
  oneof message {
    Subscription subscribe = 1;
    Ack          ack       = 2;
    Heartbeat    heartbeat = 3;
  }
}

message RefreshServerMessage {
  oneof message {
    RefreshEvent event      = 1;
    Heartbeat    heartbeat  = 2;
    Nack         nack       = 3; // e.g., invalid subscription; includes retry info
  }
}

message Ack {
  string cursor = 1;  // last successfully processed cursor
}

message Nack {
  string code           = 1;  // stable machine code, e.g., "ECS.SUBSCRIPTION.INVALID_SELECTOR"
  string human_message  = 2;
  google.rpc.RetryInfo retry = 3; // backoff guidance
}

message Heartbeat {
  google.protobuf.Timestamp time = 1;
  string server_id = 2;
}

Delivery guarantees

At-least-once: Events may repeat; dedupe by event_id or cursor.
Ordering: Per tenant and selector forward ordering is preserved best-effort; cross‑selector ordering is not guaranteed.
Resumption: Provide resume_after_cursor or Ack.cursor to continue after disconnect.
Flow control: gRPC backpressure applies. Server honors slow consumers with bounded buffers, then nacks with RESOURCE_EXHAUSTED when limits are exceeded.

Error model & rich details¶

gRPC status codes + google.rpc.*:
- INVALID_ARGUMENT + BadRequest (schema/filter/selector errors)
- UNAUTHENTICATED, PERMISSION_DENIED
- NOT_FOUND, FAILED_PRECONDITION (ETag mismatch on writes in internal admin)
- RESOURCE_EXHAUSTED + QuotaFailure (rate/tenant quotas)
- ABORTED (concurrency conflict)
- UNAVAILABLE + RetryInfo (transient)
Common ErrorInfo fields:
- reason: stable code like ECS.IDEMPOTENCY.BODY_MISMATCH
- domain: "ecs.connectsoft.cloud"
- metadata: tenantId, path, cursor, retryAfter

Metadata (headers/trailers)¶

Incoming (from client)

authorization: Bearer <jwt> (required)
x-tenant-id: <id> (required if not derivable from host; must match JWT)
x-idempotency-key: <uuid> (for InternalAdmin mutating RPCs)
traceparent / grpc-trace-bin (optional, recommended)

Outgoing (from server)

x-tenant-id, x-edition-id, x-plan-tier (echo)
x-cache: hit|miss|revalidated (Resolve*)
x-etag: <etag> (Resolve*)
Trailers may include google.rpc.status-details-bin for rich error details

Semantics & performance notes¶

Resolve / ResolveBatch¶

Reads are served from Redis (read‑through) with ETag; DB only on miss.
not_modified=true when if_none_match_etag matches current value.
expires_at provides a soft TTL hint for SDK cache; SDK should also listen to RefreshChannel for proactive reloads.

ListKeys¶

Server‑validated filter (subset of REST DSL) with allow‑listed fields only.
Stable ordering with cursor tokens (opaque).

RefreshChannel¶

Subscribe: simplest server stream; rely on TCP/HTTP2 flow control and client reconnect with resume on disconnects.
SubscribeWithAck: recommended for agents and high‑value consumers; server commits delivery when it receives Ack with last cursor.
Heartbeats sent every 15–60 s (configurable). Client should close/reopen on missing 3 heartbeats.

Versioning & compatibility¶

Package versioned at connectsoft.ecs.v1.
Additive changes only (new fields with higher tags, default behavior preserved).
New event types must default to ignore on unknown by SDKs; reserve numeric ranges per family.

Optional .NET server (code‑first) sketch¶

[ServiceContract(Name = "ResolveService", Namespace = "connectsoft.ecs.v1")]
public interface IResolveService
{
    ValueTask<ResolveResponse> Resolve(ResolveRequest request, CallContext ctx = default);
    ValueTask<ResolveBatchResponse> ResolveBatch(ResolveBatchRequest request, CallContext ctx = default);
    ValueTask<ListKeysResponse> ListKeys(ListKeysRequest request, CallContext ctx = default);
}

[ServiceContract(Name = "RefreshChannel", Namespace = "connectsoft.ecs.v1")]
public interface IRefreshChannel
{
    IAsyncEnumerable<RefreshEvent> Subscribe(Subscription sub, CallContext ctx = default);
    IAsyncEnumerable<RefreshServerMessage> SubscribeWithAck(
        IAsyncEnumerable<RefreshClientMessage> messages, CallContext ctx = default);
}

SDK guidance (client side)¶

Always set a deadline (e.g., 250–500 ms for Resolve, 2–5 s for ResolveBatch).
Respect RetryInfo and gRPC status codes; use exponential backoff with jitter for UNAVAILABLE/RESOURCE_EXHAUSTED.
Maintain a local cache keyed by (tenant, context, path) with ETag; prefer refresh stream over polling.
Deduplicate refresh events by event_id/cursor; checkpoint the last committed cursor.
Use per‑tenant channels for isolation and to preserve ordering.

Solution Architect Notes¶

The ACKed bidi stream should be the default for SDKs; simple server-stream fits lightweight/mobile.
Implement cursor compaction (periodic checkpoints) to bound replay windows.
Consider compressed payloads (grpc-encoding: gzip) for large batch resolves; set max message size caps.
Add feature flag to emit CloudEvents envelope on refresh events when integrating with external buses.
Validate multi‑region behavior: cursor is region‑scoped; cross‑region failover resets to last replicated cursor.

This contract enables high‑performance, low‑latency config reads and robust, resumable change propagation—ready for codegen and template scaffolding.

Config Versioning & Lineage — semantic versioning, snapshotting, diffs, provenance, rollback, immutability guarantees¶

Objectives¶

Define how configuration is versioned, snapshotted, diffed, traced, and rolled back in ECS with cryptographic immutability and auditable lineage—so engineering can implement storage, APIs, and SDK behaviors consistently.

Versioning Model (SemVer + Content Addressing)¶

Concepts¶

Draft – mutable workspace of a ConfigSet (unpublished).
Snapshot – immutable point‑in‑time materialization of a ConfigSet after policy validation.
Version (SemVer) – human‑friendly tag (e.g., v1.4.2) aliased to a concrete Snapshot.
Alias – semantic pointers like latest, stable, lts, environment pins (prod-current).
Content Hash – SHA-256(canonical-json) of materialized set; primary ETag and immutability anchor.

Rules¶

SemVer (MAJOR.MINOR.PATCH):
- MAJOR for breaking schema/keys removal, MINOR for additive keys, PATCH for value changes only.
- Pre‑release allowed for staged rollouts: v1.2.0-rc.1.
ETag = base64url(SHA-256(canonical-json)); strong validator for reads and updates.
Canonicalization (for hashing/diff):
- Deterministic key ordering (UTF‑8, lexicographic).
- No insignificant whitespace; numbers normalized; booleans/literals preserved.
- Secrets are represented by references (not raw values) when computing hash to avoid exposure and to allow secret rotation without content drift (see “Secrets & Immutability”).

Snapshotting Lifecycle¶

sequenceDiagram
  participant UI as Studio UI
  participant API as Registry API
  participant POL as Policy Engine
  participant AUD as Audit Store
  participant ORC as Refresh Orchestrator

  UI->>API: Save Draft (mutations…)
  UI->>API: POST /config-sets/{id}/snapshots (Idempotency-Key)
  API->>POL: Validate schema + policy overlays
  POL-->>API: OK (violations=[])
  API-->>API: Materialize canonical JSON (ordered), compute SHA-256
  API->>AUD: Append SnapshotCreated (hash, size, actor, source)
  API-->>UI: 201 Snapshot {snapshotId, etag, materializedHash}
  UI->>API: POST /deployments (optional)
  API->>ORC: Emit ConfigPublished (tenant, set, version/etag)

Hold "Alt" / "Option" to enable pan & zoom

Snapshot Invariants¶

Write‑once: Snapshot rows are append‑only. No UPDATE/DELETE.
Hash‑stable: identical content → identical hash → idempotent creation (API returns 200 with existing Snapshot).
Schema‑valid: publish fails if JSON Schema or policy guardrails fail.
Provenance‑rich: who/when/why/source captured (see below).

Lineage & Provenance¶

Data captured per Snapshot¶

Field	Description
`snapshotId` (GUID)	Unique identity, primary key.
`configSetId`	Parent set.
`semver`	Optional tag (can move between snapshots); stored in separate mapping.
`hash`	SHA‑256 of canonical materialization.
`parents[]`	Array of parent snapshot IDs (for merges); by default single parent for linear history.
`createdAt/By`	UTC timestamp + normalized actor id (`sub@iss`).
`source`	`Studio	API	Adapter:{provider}`.
`reason`	Human message “why” (commit message).
`policySummary`	Validations/approvals references.
`sbom` (optional)	Content provenance (templates/rules versions) for compliance.

Lineage graph (DAG)¶

Default: linear chain per ConfigSet.
Merge: Allowed via “import & reconcile” workflows → multi‑parent Snapshot; computed diff is 3‑way.

graph LR
  A((S-001))-->B((S-002))
  B-->C((S-003))
  C-->D((S-004))
  X((Feature-Branch S-00X))-->M((S-005 merge))
  D-->M

Hold "Alt" / "Option" to enable pan & zoom

Engineering may begin with linear history; merge support can be introduced as an additive feature flag.

Diff Strategy¶

Outputs¶

Structural diff (RFC 6902 JSON Patch) for machine application.
Human diff (grouped change list) for Studio UI, with highlights by key/area.
Semantic diff (optional) using Schema annotations to mark breaking/additive changes.

Calculation¶

Materialize both snapshots into canonical JSON.
Run structural diff producing operations: add, remove, replace, move (rare), copy (rare), test (not emitted).
Annotate operations with:
- Severity (Breaking/Additive/Neutral) via schema.
- Scope (security‑sensitive vs. safe).
- Blast radius (estimated by affected services).

API¶

POST /config-sets/{id}/diff — body: { from: snapshotId|semver, to: snapshotId|semver } Returns: { patch: JsonPatch[], summary: { breaking:int, additive:int, neutral:int } }.

Rollback Semantics¶

Rollback = create new Snapshot from historical Snapshot content; never mutates past entries.
Metadata:
- reason = "rollback to S-00A (v1.3.0)"
- rollbackOf = "S-00A"
Events:
- ConfigRollbackInitiated (audit)
- ConfigPublished (normal propagation)
Safety Checks:
- Policy re‑validation on rollback content (schemas may have evolved).
- Optional “force” flag to bypass non‑breaking warnings (requires tenant.admin).
Environment pins (aliases) update to point to the new rollback Snapshot.

Immutability Guarantees¶

Area	Guarantee	Mechanism
Snapshot content	Write‑once, tamper‑evident	Append‑only tables, content hash (ETag), optional signing (JWS)
Audit trail	Non‑repudiable	Append‑only log, sequence IDs, hash‑chain (optional)
SemVer tag	Mutable pointer	Separate mapping table; changes audited
Aliases/pins	Mutable pointer	Versioned alias map with audit
Secret values	Not embedded in materialized content	References (KMS/KeyVault path) substituted at resolve time

Secrets & Immutability: to avoid hash drift and data exposure, Snapshots store secret references (e.g., kvref://vault/secret#version) rather than plaintext. The resolved value is injected at read time by SDK/Resolver based on caller’s credentials.

Storage & Indices (solution‑level)¶

Tables (logical):

ConfigSet(id, tenantId, name, appId, status, createdAt, updatedAt, etagLatest)
Snapshot(id, configSetId, hash, createdAt, createdBy, source, reason, policySummaryJson, parentsJson)
SnapshotContent(snapshotId, contentJson, sizeBytes) — optional columnar or compressed storage
VersionTag(configSetId, semver, snapshotId, taggedAt, taggedBy) — unique(configSetId, semver)
Alias(configSetId, alias, snapshotId, updatedAt, updatedBy) — alias in {latest,stable,lts,dev-current,qa-current,prod-current,…}
Audit(id, tenantId, setId?, snapshotId?, action, actor, time, attrsJson)

Indexes (suggested):

Snapshot(configSetId, createdAt desc) for latest queries
VersionTag(configSetId, semver) and Alias(configSetId, alias)
Audit(tenantId, time desc); Audit(snapshotId)

Retention:

Keep all Snapshots by default.
Optional policy: retain N latest per set (e.g., 200) excluding those pinned by VersionTag/Alias or referenced by Deployments.

Concurrency, ETags & Idempotency¶

Draft mutations: optimistic writes with If-Match: <ETagDraft>.
Snapshot creation: idempotent by (tenantId|configSetId|hash); server returns existing Snapshot if hash matches.
Version tags: set/update with If-Match on current mapping row; conflict → 412 Precondition Failed.
Deployments/Refresh: require Idempotency-Key (as specified in REST cycle).

SDK Alignment¶

Read Path: SDK caches by (tenant, setId, context, path), using Resolve (gRPC) with if_none_match_etag. On RefreshEvent(etag), SDK fetches only when ETag changed.
Pinning: Resolve accepts version|semver|alias. When alias used, response includes actual snapshotId/semver for telemetry.
Diff Tooling: SDK helper to produce human diff from JsonPatch with path collapsing and schema annotations.

Operational Runbooks¶

Tagging & Promotion

Tag new Snapshot: POST /config-sets/{id}/tags { semver: "v1.4.0" }
Update environment pin: PUT /config-sets/{id}/aliases/prod-current -> S-00D
Audit both steps and verify ConfigPublished propagation.

Rollback

Locate target Snapshot in Studio (diff view vs current).
Trigger POST /config-sets/{id}:rollback { to: "S-00B" }
Monitor policy re‑validation; confirm alias pins and deployments updated.

Forensics

Export lineage: GET /config-sets/{id}/lineage?format=graphml|json
Verify hash chain matches SBOM and policy versions.

Example: Materialization & Hash¶

// Canonical canonical-json (excerpt)
{
  "apps": {
    "billing": {
      "db": {
        "connectionString": "kvref://vault/billing-db#v3",
        "poolSize": 50
      }
    }
  },
  "_meta": {
    "schemaVersion": "2025-08-01",
    "labels": ["prod","blue"]
  }
}
// ETag = base64url(SHA-256(bytes(canonical-json)))
// Secrets will be resolved by SDK/Resolver at runtime using kvref.

Studio UI: Version Tree & Diff UX¶

Tree view of snapshots with badges (semver, aliases, deployments).
Diff panel: switch between structural/semantic views, filter by severity (breaking|additive|neutral).
Rollback CTA gated by policy result; requires approval if breaking changes detected.

Solution Architect Notes¶

Start with linear lineage (single parent); design tables to allow future DAG merges.
Implement canonicalizer as a shared library (used by Registry & SDKs) to avoid hashing inconsistencies.
Add optional snapshot signing (JWS) in a future cycle to strengthen tamper evidence and supply chain stories.
Confirm secret reference design with Security and Provider Adapter teams to ensure consistent resolution across clouds.
Define policy‐driven SemVer bumping: server can recommend version bump level based on semantic diff.

Storage Strategy — CockroachDB (MT‑Aware) & Redis Cache Topology¶

Objectives¶

Provide a multi‑tenant, multi‑region storage design that preserves immutability for configuration versions and snapshots.
Optimize read latency for SDKs/agents via Redis + regional replicas while keeping CockroachDB (CRDB) as the system of record.
Enforce edition‑aware retention, partitioning, and operational recoverability (PITR, incremental backups, table‑level restore).

Logical Data Model (solution view)¶

erDiagram
    TENANT ||--o{ APP : owns
    TENANT ||--o{ ENVIRONMENT : scopes
    APP ||--o{ NAMESPACE : groups
    NAMESPACE ||--o{ CONFIG_SET : defines
    CONFIG_SET ||--o{ CONFIG_VERSION : immutably_versions
    CONFIG_VERSION ||--o{ SNAPSHOT : captures
    CONFIG_VERSION ||--o{ DIFF : compares
    CONFIG_SET ||--o{ POLICY_BINDING : governed_by
    CONFIG_SET ||--o{ ROLLOUT : deployed_via
    ROLLOUT ||--o{ ROLLOUT_STEP : stages
    EVENT_AUDIT }o--|| TENANT : scoped

    TENANT {
      uuid tenant_id PK
      text slug UNIQUE
      text edition  // "Free","Pro","Enterprise"
      string crdb_region_home
    }
    APP {
      uuid app_id PK
      uuid tenant_id FK
      text key  // unique per tenant
      text display_name
    }
    ENVIRONMENT {
      uuid env_id PK
      uuid tenant_id FK
      text name  // "dev","test","prod"
    }
    NAMESPACE {
      uuid ns_id PK
      uuid tenant_id FK
      uuid app_id FK
      text path   // e.g. "payments/api"
    }
    CONFIG_SET {
      uuid set_id PK
      uuid tenant_id FK
      uuid ns_id FK
      uuid env_id FK
      text set_key
      bool is_composite
      text content_hash  // current head
    }
    CONFIG_VERSION {
      uuid version_id PK
      uuid tenant_id FK
      uuid set_id FK
      string semver
      timestamptz created_at
      text author
      text change_summary
      jsonb content  // canonical compiled payload
      text content_hash UNIQUE
      bool is_head   // convenience flag
      text provenance  // URI/commit ref
      bool signed
      bytea signature
    }
    SNAPSHOT {
      uuid snapshot_id PK
      uuid tenant_id FK
      uuid version_id FK
      timestamptz captured_at
      jsonb content
      text source  // "manual","pre-rollout","post-rollout"
    }
    DIFF {
      uuid diff_id PK
      uuid tenant_id FK
      uuid from_version_id FK
      uuid to_version_id FK
      jsonb diff_json  // RFC6902/semantic diff
      text strategy    // "jsonpatch","semantic"
    }
    POLICY_BINDING {
      uuid policy_id PK
      uuid tenant_id FK
      uuid set_id FK
      jsonb rules
    }
    ROLLOUT {
      uuid rollout_id PK
      uuid tenant_id FK
      uuid set_id FK
      uuid target_env_id FK
      text status  // "planned","in-progress","complete","failed","rolled-back"
      text strategy  // "all-at-once","canary","wave"
    }
    ROLLOUT_STEP {
      uuid step_id PK
      uuid tenant_id FK
      uuid rollout_id FK
      int step_no
      jsonb selector  // services/pods/regions
      jsonb outcome
    }
    EVENT_AUDIT {
      uuid event_id PK
      uuid tenant_id FK
      text type
      jsonb payload
      timestamptz at
      text actor
      text trace_id
    }

Hold "Alt" / "Option" to enable pan & zoom

Notes

tenant_id scopes every row. DTOs and queries must supply tenant_id (and often env_id) for isolation and index selectivity.
CONFIG_VERSION.content is the immutable compiled document delivered to SDKs (policy already applied). Raw fragments (if used) live in internal tables or object storage and are referenced via provenance.

Physical Design in CockroachDB¶

Multi‑Region & Locality¶

Database locality: ALTER DATABASE ecs SET PRIMARY REGION <home>; ADD REGION <others>; SURVIVE REGION FAILURE;
Table locality:
- Control/lookup tables: REGIONAL BY ROW with column crdb_region (derived from tenant’s home or set/rollout target).
- Global reference tables (rare): GLOBAL (e.g., editions, feature flags) to avoid cross‑region fan‑out.
Write routing: SDK/Studio via Gateway injects crdb_region/tenant_id; CRDB routes to nearest leaseholder.

Keys, Sharding & Indexes¶

Primary keys: hash‑sharded to avoid hot‑ranges.
- Example: PRIMARY KEY (tenant_id, set_id, semver) USING HASH WITH BUCKET_COUNT = 16
Surrogate IDs: prefer UUIDv7 for time‑ordered locality; store also semver for human lookup.
Secondary indexes (all with STORING where helpful):
- CONFIG_VERSION(tenant_id, set_id, is_head DESC, created_at DESC)
- CONFIG_VERSION(tenant_id, content_hash) for fast idempotency checks.
- ROLLOUT(tenant_id, target_env_id, status) for orchestration queries.
JSONB columns (content, diff_json) gain targeted GIN indexes for frequently filtered paths (e.g., /featureToggles/*).

Concurrency, Immutability & Idempotency¶

New version creation:
- Compute content_hash; reject duplicates (idempotent PUT).
- Mark prior is_head = false atomically.
- Use SERIALIZABLE transactions with small write sets (CRDB default).
No in‑place edits of CONFIG_VERSION.content. Rollback = new version with change_summary="revert to X" and pointer flip.

Sample DDL (illustrative)¶

-- Multi-region setup (performed once)
ALTER DATABASE ecs SET PRIMARY REGION eu-central;
ALTER DATABASE ecs ADD REGION eu-west;
ALTER DATABASE ecs ADD REGION us-east;
ALTER DATABASE ecs SURVIVE REGION FAILURE;

-- Config versions table
CREATE TABLE ecs.config_version (
  tenant_id UUID NOT NULL,
  set_id UUID NOT NULL,
  version_id UUID NOT NULL DEFAULT gen_random_uuid(),
  semver STRING NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  author STRING NULL,
  change_summary STRING NULL,
  content JSONB NOT NULL,
  content_hash STRING NOT NULL,
  provenance STRING NULL,
  is_head BOOL NOT NULL DEFAULT false,
  signed BOOL NOT NULL DEFAULT false,
  signature BYTES NULL,
  crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
  CONSTRAINT pk_config_version PRIMARY KEY (tenant_id, set_id, semver) USING HASH WITH BUCKET_COUNT = 16,
  UNIQUE (tenant_id, content_hash),
  UNIQUE (tenant_id, set_id, version_id)
) LOCALITY REGIONAL BY ROW;

CREATE INDEX ix_config_version_head ON ecs.config_version (tenant_id, set_id, is_head DESC, created_at DESC) STORING (content_hash, version_id);

-- TTL example for audit events (see Retention section)
CREATE TABLE ecs.event_audit (
  tenant_id UUID NOT NULL,
  event_id UUID NOT NULL DEFAULT gen_random_uuid(),
  type STRING NOT NULL,
  payload JSONB NOT NULL,
  at TIMESTAMPTZ NOT NULL DEFAULT now() ON UPDATE now(),
  actor STRING NULL,
  trace_id STRING NULL,
  crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
  ttl_expires_at TIMESTAMPTZ NULL,
  CONSTRAINT pk_event_audit PRIMARY KEY (tenant_id, at, event_id)
) LOCALITY REGIONAL BY ROW
  WITH (ttl = 'on', ttl_expiration_expression = 'ttl_expires_at', ttl_job_cron = '@hourly');

Redis Cache Topology (read path acceleration)¶

Topology¶

Redis Cluster (3+ shards per region, replicas ×1) deployed per cloud region close to ECS Gateway.
Tenant‑pinned key hashing using hash‑tags to keep hot tenants together and enable selective scaling: ecs:{tenantId}:{env}:{ns}:{setKey}:v{semver} → value = compact binary (MessagePack) or gzip JSON.
SDK Near‑Cache (optional) + soft TTL to bound staleness without thundering herds.

Cache Patterns¶

Concern	Pattern
Freshness	Pub/Sub channel `ecs:invalidate:{tenantId}` from Refresh Orchestrator; SDKs subscribe (websocket) or poll backoff.
Warm‑up	On new head: background warmers push into Redis (read‑through fallback to CRDB if miss).
Stampede control	Singleflight with Redis `SET key lock NX PX=<ms>`; losers await pub/sub invalidate.
Multi‑region	Writes publish region‑scoped invalidations; cross‑region mirror via replication stream (or event bus) for global tenants.
Security	Values include HMAC(content_hash); SDKs verify before use.

TTL & Sizing¶

Default TTL = 5–30s (edition‑dependent), max TTL for offline resilience (e.g., 10m).
Memory policy: allkeys-lru (Free/Pro), volatile-lru (Enterprise with pinning for critical keys).
Per‑tenant quotas enforced by keyspace cardinality + prefix scanning in ops tools.

Partitioning Strategy¶

By Tenant & Region¶

Every table keyed by tenant_id; hash‑sharded PK prevents hot ranges.
REGIONAL BY ROW + crdb_region places data near compute. Rollouts targeting us-east can place derived rows there.

By Time (high‑volume tables)¶

EVENT_AUDIT and rollout telemetry: composite PK (tenant_id, at, event_id) enables time‑bounded range scans and row‑level TTL.
Optional monthly export/compact to object storage for long‑term archives.

Large Payloads¶

Keep operational content (<= 128 KB) in CRDB JSONB; store oversized artifacts (e.g., generated diffs > 1 MB) in object storage and reference by URI in provenance.

Retention & Data Lifecycle¶

Data class	Default	Edition overrides	Mechanism
Head versions (`is_head = true`)	Keep indefinitely	—	none
Historical config versions	365 days	Enterprise: infinite / policy based	soft policy + archival export
Snapshots (pre/post rollout)	180 days	Enterprise: 730 days	CRDB row‑level TTL on `ttl_expires_at`
Audit events, access logs	90 days	Enterprise: 365 days	TTL + periodic export (Parquet)
Diff artifacts	90 days	Pro/Ent: 180 days	TTL

Implementation

Set ttl_expires_at = now() + INTERVAL 'N days' by class & edition.
Nightly export of expiring partitions to object storage (Parquet) before purge.

Backup & Restore (RPO/RTO)¶

Objectives¶

RPO ≤ 5 minutes (Enterprise), ≤ 15 minutes (Pro), ≤ 24 hours (Free).
RTO ≤ 30 minutes for tenant‑scoped table restore; ≤ 2 hours full cluster DR (Enterprise).

Strategy¶

Cluster‑wide scheduled backups:
- Weekly full + hourly incremental to cloud bucket (regional replica buckets for locality).
- Encryption with cloud KMS; rotate keys quarterly.
Changefeeds (optional) for external archival/analytics sinks.
PITR: CRDB protected timestamps on back up schedule to enable point‑in‑time restore.
Tenant‑scoped restore: use export/restore with tenant_id predicate via AS OF SYSTEM TIME + SELECT INTO staging + validated merge (playbook below).

Operational Runbook (excerpt)¶

Incident triage: identify tenant/environment, time window, affected tables.
Quarantine writes: toggle tenant read‑only via Policy Engine; invalidate Redis keys.
Staging restore: create temporary DB from the latest full+incremental or PITR to the timestamp.
Diff & verify: compare CONFIG_VERSION by content_hash; verify signatures/HMAC; rehearse on staging.
Merge: upsert corrected rows by (tenant_id, set_id, semver) into prod; rebuild is_head where needed (transaction).
Warm cache: repopulate Redis; emit RefreshRequested events.
Close incident: re‑enable writes; attach audit and post‑mortem.

Observability & Capacity Planning¶

Key Metrics¶

CRDB: sql_bytes, ranges, qps, txn_restarts, kv.raft.process.logcommit.latency, per‑region leaseholder distribution.
Redis: hits/misses, evictions, blocked_clients, latency, keyspace_per_tenant.
ECS: cache hit ratio (SDK), p95 config fetch latency, rollout convergence time.

Back‑of‑Envelope Sizing (initial)¶

Avg config payload: 8–32 KB (compressed on wire).
Tenant cardinality: N_tenants; each with ~A apps × E env × S sets × V versions.
Storage ≈ N * A * E * S * V * 24 KB + indexes (~1.6×). Start with 3‑region, 9‑node CRDB (3 per region, n2-standard-8 class) and 3× Redis shards per region.

Integration with Refresh Orchestrator¶

On CONFIG_VERSION commit (is_head=true flip), emit ConfigHeadChanged event.
Orchestrator:
1. Upsert compiled payload to Redis {tenant} shard in write region.
2. Publish ecs:invalidate:{tenantId} with (setKey, semver, content_hash).
3. For canary/waves, publish scoped invalidations using {selector}.

Security Considerations¶

Row‑level scoping by tenant_id enforced in all data access paths; Gateway injects claims → service verifies.
At‑rest encryption (CRDB/TDE if available) + backup encryption with KMS.
Redis: TLS, AUTH/ACLs, key prefix isolation, and value MAC (HMAC of content_hash + tenant secret) validated by SDKs.

Solution Architect Notes¶

If config content > 128 KB per set becomes common, move content to object storage and store pointers + hashes in CRDB; keep small materialized views for hot paths in Redis.
Validate whether per‑tenant PITR is required for Free/Pro; if not, restrict to Enterprise to control storage cost.
Decide diff strategy default (jsonpatch vs semantic) per SDK’s needs; semantic tends to be friendlier for rollout previews.

Acceptance Criteria (engineering hand‑off)¶

DDL migrations emitted with multi‑region locality, hash‑sharded PKs, and required indexes.
Row‑level TTL configured for audit / diff tables with edition‑aware durations.
Redis cluster charts (Helm) with per‑region topology, metrics, and alerting; key naming spec published to SDK teams.
Backup schedules, KMS bindings, and restore playbook present in runbooks; table‑level rehearsal completed in staging.
Telemetry dashboards showing p95 CRUD, cache hit %, restore drill duration, and rollout convergence.

Eventing & Change Propagation — CloudEvents, topics, delivery semantics, DLQs, replay; MassTransit conventions¶

Objectives¶

Define the event model and propagation pipeline used by ECS to notify SDKs/services of configuration changes with predictable semantics:

CloudEvents 1.0 envelopes (JSON structured-mode).
Clear topic map & naming.
At‑least‑once delivery with idempotency, ordering keys, DLQs, and replay.
MassTransit conventions for Azure Service Bus (default) and RabbitMQ (alt).

Event Envelope (CloudEvents)¶

Transport: JSON (structured mode) over AMQP (ASB) or AMQP 0‑9‑1 (Rabbit); HTTP/gRPC use binary mode when needed.

Required attributes

specversion: "1.0"
id: string — globally unique per event
source: "ecs://{region}/{service}/{resource}" — e.g., ecs://eu-central/config-registry/config-sets/4b...
type: "ecs.{domain}.v1.{EventName}" — e.g., ecs.config.v1.ConfigPublished
subject: "{tenantId}/{pathOrSetId}" — routing hint for consumers
time: RFC3339

Extensions (multi‑tenant & lineage)

tenantid: string
editionid: string
environment: string (dev|test|staging|prod)
etag: string
version: string (snapshot or semver)
correlationid: string (propagated from API)
actor: string (sub@iss)
region: string (emit region)

Example (structured)

{
  "specversion": "1.0",
  "id": "e-9f1b2f7f-b49e-4f5a-90d0-93a7a6a4b0ef",
  "source": "ecs://eu-central/config-registry/config-sets/4b1c",
  "type": "ecs.config.v1.ConfigPublished",
  "subject": "tnt-7d2a/apps/billing/sets/runtime",
  "time": "2025-08-24T09:12:55Z",
  "tenantid": "tnt-7d2a",
  "editionid": "enterprise",
  "environment": "prod",
  "version": "v1.8.0",
  "etag": "9oM1hQ...",
  "correlationid": "c-77f2...",
  "region": "eu-central",
  "data": {
    "setId": "4b1c...",
    "path": "apps/billing/**",
    "changes": { "breaking": 0, "additive": 3, "neutral": 1 }
  }
}

Topic Map & Naming¶

Exchange/Topic names (canonical, kebab-case):

ecs.config.events — lifecycle of config sets & versions
- ecs.config.v1.ConfigDraftSaved
- ecs.config.v1.ConfigPublished
- ecs.config.v1.ConfigRolledBack
ecs.policy.events — policy/edition/schema changes
- ecs.policy.v1.PolicyUpdated
ecs.refresh.events — cache targets & client refresh signals
- ecs.refresh.v1.CacheInvalidated
- ecs.refresh.v1.RefreshRequested
ecs.adapter.events — external provider sync results
- ecs.adapter.v1.SyncCompleted
- ecs.adapter.v1.SyncFailed

Routing/partition keys

Primary key: tenantid
Secondary (optional): pathPrefix or setId
Ensures per‑tenant ordering and hot-tenant isolation.

Queue/Subscription naming (MassTransit)

Consumer queues: ecs-{service}-{consumer}-{env} (e.g., ecs-refresh-orchestrator-configpublished-prod)
Error/DLQ: {queue}_error, {queue}_skipped (parking lot)
Scheduler (delayed): ecs-scheduler (uses ASB/Rabbit delayed delivery plugin where available)

Delivery Semantics¶

Property	Choice	Rationale
Delivery	At‑least‑once	Simpler guarantees; consumers must be idempotent
Ordering	Best‑effort global, ordered per `(tenantid, key)`	Partitioning by tenant keeps most flows ordered
Idempotency	Required at consumers	Dedupe by `id` (event store) or `(tenantid, path, etag)`
Visibility	Structured CloudEvents	Uniform across bus/HTTP
Fan‑out	Topic → consumer queues	Decoupled consumers; backpressure per queue

Consumer idempotency keys

Config state changes: (tenantid, setId, version|etag)
Cache invalidation: (tenantid, path, etag)
Adapter sync: (tenantid, provider, cursor)

Retries, DLQs, and Parking Lots¶

Retry policy (MassTransit middleware)

Immediate retries: 3
Exponential retries: 5 attempts, 2s → 1m (jitter)
Circuit‑break: open after 50 failures/60s; half-open after 30s

DLQ

Poison messages route to {queue}_error with headers:
- mt-fault-message, mt-reason, mt-host, mt-exception-type, stacktrace
- CloudEvents attributes echoed for triage (tenantid, type, id)
Parking lot {queue}_skipped for known non‑actionable events (e.g., stale versions), enabling manual replay later.

Operational actions

Retry single message: move from DLQ → main queue
Bulk reprocess: export DLQ to blob, filter, re‑enqueue via Replay Worker

Replay Strategy¶

Sources

Outbox/Audit log (authoritative): every emitted event recorded with status=Emitted|Pending.
Broker retention (short): not guaranteed for long windows → rely on outbox.

Replay worker

Input: time/tenant filter or cursor range
Reads outbox, re‑emits CloudEvents with new id and data.replayOf="<original-id>" (to avoid consumer dedupe drop)
Throttled per tenant; writes operator annotations in meta (reason: "replay", ticket: ...)

Consumer expectations

Treat replayOf as informational; still dedupe on current id
Business logic must tolerate duplicate state transitions

Outbox & Inbox Patterns¶

Outbox (publisher side, e.g., Config Registry)

Within the same transaction as state change:
- Append Outbox(tenantid, type, subject, data, eventId, status='Pending')
Background dispatcher (MassTransit) reads Pending, publishes, marks Emitted
Guarantees atomicity between DB write and event publication

Inbox (consumer side, optional)

Table Inbox(eventId, consumer, processedAt)
Before handling, check presence; after success, insert → ensures exactly‑once processing per consumer

MassTransit Conventions (ASB default, Rabbit alt)¶

Common

EntityNameFormatter = kebab-case
Endpoint per message type disabled; use explicit endpoints per service
Prefetch tuned per consumer: start = 32–128, adjust by p99
ConcurrentMessageLimit per handler: start = CPU cores * 2
Observability: OpenTelemetry enabled; activity name = ecs.{messageType}; tags: tenantid, editionid, environment, etype

Azure Service Bus

cfg.UsingAzureServiceBus((context, sb) =>
{
  sb.Host(connStr);
  sb.Message<ConfigPublished>(m => m.SetEntityName("ecs.config.events"));
  sb.SubscriptionEndpoint<ConfigPublished>("ecs-refresh-orchestrator-configpublished-prod", e =>
  {
    e.ConfigureConsumer<ConfigPublishedConsumer>(context);
    e.PrefetchCount = 128;
    e.MaxAutoRenewDuration = TimeSpan.FromMinutes(5);
    e.EnableDeadLetteringOnMessageExpiration = true;
    e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(2), TimeSpan.FromMinutes(1), TimeSpan.FromSeconds(3)));
    e.UseCircuitBreaker(cb => cb.ResetInterval = TimeSpan.FromSeconds(30));
    e.UseInMemoryOutbox(); // plus persistent Outbox at publisher
  });
});

RabbitMQ

cfg.UsingRabbitMq((context, rmq) =>
{
  rmq.Host(host, "/", h => { h.Username(user); h.Password(pass); });
  rmq.Message<ConfigPublished>(m => m.SetEntityName("ecs.config.events"));
  rmq.ReceiveEndpoint("ecs-refresh-orchestrator-configpublished-prod", e =>
  {
    e.Bind("ecs.config.events", x => { x.RoutingKey = "ecs.config.v1.ConfigPublished"; x.ExchangeType = ExchangeType.Topic; });
    e.PrefetchCount = 64;
    e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(60), TimeSpan.FromSeconds(3)));
    e.UseDelayedRedelivery(r => r.Intervals(TimeSpan.FromSeconds(10), TimeSpan.FromSeconds(30)));
    e.UseInMemoryOutbox();
  });
});

Standard Event Contracts (data payloads)¶

`ecs.config.v1.ConfigPublished`¶

{
  "setId": "4b1c...",
  "path": "apps/billing/**",
  "snapshotId": "S-00F",
  "version": "v1.8.0",
  "etag": "9oM1hQ...",
  "policy": { "breaking": false, "violations": [] }
}

`ecs.refresh.v1.CacheInvalidated`¶

{
  "scope": { "path": "apps/billing/**", "environment": "prod" },
  "targets": ["redis", "sdk"],
  "reason": "publish",
  "etag": "9oM1hQ..."
}

`ecs.adapter.v1.SyncCompleted`¶

{
  "provider": "azure-appconfig",
  "cursor": "w/1724490000",
  "items": 128,
  "changes": { "upserts": 120, "deletes": 8 }
}

End‑to‑End Flow (publish → refresh)¶

sequenceDiagram
  participant UI as Studio
  participant API as Registry
  participant OB as Outbox
  participant BUS as Event Bus
  participant ORC as Refresh Orchestrator
  participant RED as Redis
  participant SDK as Client SDK

  UI->>API: POST /config-sets/{id}/snapshots (publish)
  API-->>OB: tx write + outbox pending(ConfigPublished)
  API-->>UI: 201 Snapshot/Version
  OB-->>BUS: publish CloudEvent (ConfigPublished)
  BUS-->>ORC: deliver
  ORC->>RED: scoped invalidation
  ORC-->>BUS: publish CacheInvalidated
  BUS-->>SDK: deliver (via WS bridge) / notify
  SDK->>API: Resolve(etag) → 200/304

Hold "Alt" / "Option" to enable pan & zoom

Backpressure & Throttling¶

Per‑tenant quotas (see Gateway cycle); event consumers must:
- Pause on RESOURCE_EXHAUSTED (MassTransit retry with backoff)
- Limit in‑flight resolves on refresh storms (coalesce by (tenantid,path) within 500 ms window)
Orchestrator maintains bounded buffers; on overflow, drops to parking lot and raises RefreshDegradation alert.

Security¶

Bus credentials scoped to producer/consumer roles (least privilege).
CloudEvents actor and tenantid are validated from JWT at publisher; consumers must not trust unvalidated extensions from third parties.
Events avoid embedding secret values; only hashes/refs.

Observability & Alerting¶

Spans

ecs.publish, ecs.invalidate, ecs.replay
Attributes: tenantid, event.type, queue, delivery.attempt

Metrics

Publish success rate, end‑to‑end propagation lag (publish→SDK resolve), DLQ size, replay throughput
Consumer handler p95/p99, retries, circuit state

Alerts

PropagationLagP95>5s (5m)
DLQSize>100 (10m) per queue
ReplayFailureRate>1%

Acceptance Criteria (engineering hand‑off)¶

MassTransit bus configuration (ASB + Rabbit) with kebab-case entities, retries, DLQs, OTEL.
CloudEvents envelope library (shared) with validation and extensions.
Outbox dispatcher and Replay worker with CLI & Studio hooks.
Contract tests: idempotency, partition ordering, DLQ/replay, propagation lag budget.
Runbooks: DLQ triage, targeted replay, tenant backpressure override.

Solution Architect Notes¶

Start with per‑tenant partitioning; introduce path‑sharded partitions if a few tenants dominate traffic.
Consider compaction (e.g., Kafka) when adding an analytics/event‑sourcing lane; current MVP favors ASB/Rabbit simplicity.
Validate mobile clients via server‑side WS bridge that consumes ecs.refresh.events and fans out over WebSocket/SSE with CloudEvents.

Refresh & Invalidation Flows — SDK pull/long‑poll/websocket, server push, cache stampede protection, ETags¶

Objectives¶

Design the end‑to‑end cache refresh and invalidation path that keeps SDKs and services current within seconds, while protecting the platform from thundering herds and ensuring multi‑tenant isolation.

Outcomes

Three client models: periodic pull, long‑poll, WebSocket/SSE push.
Server‑side targeted invalidation and coalesced refresh.
ETag/conditional fetch as the primary coherency mechanism (304/not_modified).
Stampede control at SDK and edge with singleflight + distributed locks and stale‑while‑revalidate.

Control Plane vs Data Plane¶

Plane	Responsibility	Tech
Control	Propagate change intent	CloudEvents (`ConfigPublished`, `CacheInvalidated`)
Data	Deliver effect (new config)	REST `GET /resolve` (ETag) and gRPC `Resolve/ResolveBatch`

Principle: Control plane never carries config values; clients always refetch using ETag‑aware reads.

Client Models¶

1) Periodic Pull (baseline)¶

SDK timer fetches with If-None-Match: <etag> every T seconds (edition‑aware default: Starter=30s, Pro=10s, Enterprise=5s).
Pros: simplest, firewall‑friendly.
Cons: higher background traffic; longer staleness.

2) Long‑Poll (recommended default)¶

SDK calls /resolve?waitForChange=true&timeout=30s (or gRPC Resolve with deadline).
Server blocks until new ETag or timeout → 200 with value or 304 Not Modified.
Pros: low idle traffic, near‑real‑time without persistent sockets.

3) WebSocket / SSE Push (premium)¶

SDK subscribes to Refresh Channel (WebSocket or gRPC streaming).
Server pushes RefreshEvent(etag, path, cursor); SDK revalidates via Resolve.
Pros: sub‑second fanout, very low latency.
Cons: long‑lived connections; need heartbeats and resumption.

Flow Diagrams¶

A) Publish → Fanout → Client Revalidate¶

sequenceDiagram
  participant Studio as Studio UI
  participant Registry as Config Registry
  participant Orchestrator as Refresh Orchestrator
  participant Redis as Redis Cache
  participant SDK as App SDK

  Studio->>Registry: Publish snapshot (idempotent)
  Registry-->>Redis: DEL tenant:path:* (scoped invalidation)
  Registry->>Orchestrator: emit ConfigPublished(tenant, path, etag)
  Orchestrator-->>SDK: push RefreshEvent(tenant, path, etag) [WS/L.Poll wake]
  SDK->>Registry: Resolve(path, If-None-Match: oldEtag)
  alt New ETag
    Registry-->>SDK: 200 value + ETag(new)
    SDK-->>SDK: Update L1 cache
  else No change
    Registry-->>SDK: 304 Not Modified
  end

Hold "Alt" / "Option" to enable pan & zoom

B) Long‑Poll Resolve (HTTP)¶

sequenceDiagram
  participant SDK
  participant GW as API Gateway
  participant API as Resolve API
  SDK->>GW: GET /resolve?path=...&waitForChange=true&timeout=30 (If-None-Match: etag)
  GW->>API: forward (deadline=30s)
  API-->>API: await new etag OR timeout (register waiter keyed by tenant+path)
  alt etag changed
    API-->>SDK: 200 value + ETag
  else timeout
    API-->>SDK: 304 Not Modified
  end

Hold "Alt" / "Option" to enable pan & zoom

C) WebSocket Refresh (server push)¶

sequenceDiagram
  participant SDK
  participant Push as Refresh WS Bridge
  SDK->>Push: WS CONNECT /ws/refresh (x-tenant-id, JWT)
  Push-->>SDK: HEARTBEAT 30s
  Push-->>SDK: RefreshEvent(path, etag, cursor)
  SDK->>Push: ACK cursor (bidi) OR implicit via next pull
  SDK->>API: Resolve(If-None-Match: etag)

Hold "Alt" / "Option" to enable pan & zoom

Server‑Side Invalidation & Coalescing¶

Targets

Primary: Redis keys ecs:{tenant}:{env}:{set}:{path}
Secondary: In‑process L2 cache (optional) invalidated by local event bus.

Coalescing

The Resolve API maintains a waiter map per (tenant, path):
- On publish, it wakes all waiters and debounces new waiters for 200–500 ms to batch misses.
The Orchestrator emits single refresh signals per (tenant, path, etag) and coalesces bursts within 250 ms windows.

SDK Caching & ETag Strategy¶

Cache layers¶

L1 (in‑process) with soft TTL and ETag index.
L2 (Redis) on server‑side when SDK sits behind a service (optional; SDK still resolves via API).

ETag rules¶

Treat ETag as strong validator of materialized config (secrets still references).
Always send If-None-Match on Resolve.
Respect not_modified/304 to preserve L1 value and extend soft TTL.

Suggested SDK algorithm (pseudo)¶

function getConfig(path, ctx):
  key = (tenant, ctx, path)
  entry = L1.get(key)
  if entry && entry.fresh(): return entry.value

  // singleflight per key to prevent stampede
  v = singleflight(key, () => {
      etag = entry?.etag
      resp = Resolve(path, ctx, if_none_match=etag, deadline=250ms)
      if resp.not_modified: 
         entry.touch()
         return entry.value
      else:
         L1.put(key, resp.resolved, resp.etag, soft_ttl(ctx))
         return resp.resolved
  })

  return v

Stampede Protection¶

Layer	Technique	Notes
SDK	singleflight per `(tenant, ctx, path)`	Collapse concurrent calls inside a process.
API	request collapsing + memoization for active resolves	Reuse upstream result for waiters.
Redis	distributed lock on cache fill (SET NX PX)	Timeout ≤ 250 ms; losers get stale‑while‑revalidate.
TTL	stale‑while‑revalidate (SWR)	Serve stale for ≤ 2s while a single refresher fetches.
Jitter	TTL jitter ±10–20%	Spread expirations to avoid sync spikes.
Backoff	jittered exponential after 429/503	Max backoff 2s for reads.

Negative caching: For consistent 404 resolves, cache NEGATIVE(etag) for short TTL (≤ 5s) to avoid hammering.

Long‑Poll API Contract (REST)¶

GET /api/v1/resolve?path={p}&env={e}&version=latest&waitForChange=true&timeout=30
Headers:
  If-None-Match: "<etag>"
Responses:
  200 OK            body: value, headers: ETag: "<new>"
  304 Not Modified  (on timeout or same etag)

Server behavior

Max timeout = 30s (configurable).
Registers waiter; returns 304 on timeout to allow client to re‑issue without breaking caches.

WebSocket/SSE Contract (HTTP)¶

Handshake

GET /ws/refresh (WS) or GET /sse/refresh (SSE) with Authorization and x-tenant-id.

Messages (JSON)

// server -> client
{ "type":"refresh", "cursor":"c-12345", "path":"apps/billing/**", "etag":"9oM1hQ..." }
{ "type":"heartbeat", "ts":"2025-08-24T11:00:00Z" }
{ "type":"nack", "code":"ECS.SUBSCRIPTION.INVALID", "retryAfterMs":2000 }

// client -> server (WS only)
{ "type":"ack", "cursor":"c-12345" }

Resumption

Client reconnects with ?resumeAfter=c-<cursor> to avoid gaps; server backfills from replay window.

gRPC Alignment (from Contracts cycle)¶

ResolveRequest.if_none_match_etag → ResolveResponse.not_modified/etag.
RefreshChannel.Subscribe/SubscribeWithAck provide the push lane; SDKs always revalidate with Resolve.

Health, Heartbeats & Timeouts¶

WS/SSE heartbeat every 15–30s; close on 3 missed heartbeats.
Long‑poll deadline recommended = timeout + 2s.
SDK should rotate tokens before expiry; reconnect on UNAUTHENTICATED/401.

Rate Governance & Flood Safety¶

Gateway applies per‑tenant read QPS limits (edition‑aware).
Orchestrator coalesces and drops duplicates within a 250 ms window; emits a single refresh per ETag.
Clients must obey Retry-After on 429 and reduce concurrent resolves (cap to 2 in‑flight per process).

Failure Modes & Degradation Paths¶

Failure	Client Behavior	Server Behavior
WS bridge down	Fallback to long‑poll; increase `T` by ×2	Auto‑heal; drain buffers; send `replayOf` on resume
Event bus lag	Continue periodic pull; widen interval by +50%	Alert; protect Redis with SWR
Redis unavailable	Resolve from DB (higher latency)	Bypass Redis; apply per‑tenant circuit breaker
ETag mismatch loops	Full resolve without `If-None-Match` once; then resume ETag path	Log and re‑materialize canonical cache line
Rate‑limit 429	Jittered backoff (100–500 ms)	Emit `RateLimitExceeded` audit; expose limits in headers

Observability¶

SDK emits

ecs.sdk.resolve.latency (p50/p95/p99), cache.hit/miss, lp.wakeups, ws.reconnects, etag.rotate.count.

Server dashboards

Propagation lag (publish → first 200 Resolve)
Waiter pool size & coalescing ratio
Redis lock wait time & lock contention
Long‑poll saturation (% requests returning 304 on timeout)

Acceptance Criteria (engineering hand‑off)¶

Implement waiter registry with safe cancellation; unit tests for coalescing and timeout behavior.
Add singleflight utility in SDKs (.NET/JS/Mobile) with per‑key granularity.
Redis invalidation with scoped keys and distributed lock on fill.
WS/SSE bridge with heartbeats, resume tokens, per‑tenant connection caps.
End‑to‑end tests: publish → SDK updates within <5s (Pro/Ent), <15s (Starter) under load, no stampede.

Solution Architect Notes¶

Prefer long‑poll as default mode across SDKs; enable WS behind a feature flag per tenant/edition.
Keep SWR≤2s to bound staleness while avoiding spikes.
Consider a per‑tenant, per‑path moving window to limit redundant resolves (e.g., ignore duplicate refresh events for 200 ms).
For mobile, use SSE when WS is blocked; keep battery impact minimal with adaptive backoff.

Provider Adapter Hub – Plug‑in Model, Contracts & Lifecycles¶

This section specifies the Provider Adapter Hub that bridges ECS with external configuration backends (Azure App Configuration, AWS AppConfig, Consul, Redis, SQL/CockroachDB). It defines the plug‑in model, capability contracts, adapter lifecycles, multi‑tenant isolation, and operational guarantees so Engineering Agents can implement adapters consistently and safely.

Architectural Context¶

flowchart LR
  subgraph ECS Core
    Hub[Adapter Hub]
    Registry[Config Registry]
    Policy[Policy Engine]
    Refresh[Refresh Orchestrator]
    EventBus[[Event Bus (CloudEvents)]]
  end

  subgraph Adapters (Out-of-Process Plugins)
    AAZ[Azure AppConfig Adapter]
    AAW[AWS AppConfig Adapter]
    ACS[Consul KV Adapter]
    ARD[Redis Adapter]
    ASQL[SQL/CockroachDB Adapter]
  end

  Hub <-- gRPC SPI --> AAZ
  Hub <-- gRPC SPI --> AAW
  Hub <-- gRPC SPI --> ACS
  Hub <-- gRPC SPI --> ARD
  Hub <-- gRPC SPI --> ASQL

  Registry <--> Hub
  Policy <--> Hub
  Refresh <--> Hub
  EventBus <--> Hub

Hold "Alt" / "Option" to enable pan & zoom

Design stance

Out‑of‑process adapters run as isolated containers (per provider) and expose a gRPC Service Provider Interface (SPI) to the Hub.
Capability-driven: Each adapter declares supported feature flags (e.g., hierarchical keys, ETag, watch/stream, transactions).
Multi‑tenant aware: One adapter process may serve multiple tenants via namespaced bindings with per-tenant credentials.
Event-first: All changes propagate as CloudEvents, with at‑least‑once delivery and idempotency keys.

Plug‑in Packaging & Discovery¶

Aspect	Requirement
Image Layout	OCI container, label `io.connectsoft.ecs.adapter=true` with `adapter.id`, `adapter.version`, `capabilities` annotations
SPI Transport	gRPC over mTLS (cluster DNS). Optional Unix domain socket for sidecar deployments.
Registration	Adapter posts a Manifest to Hub on startup (`Register()`), Hub persists in Registry.
Upgrades	Rolling upgrade supported; Hub revalidates capabilities and drains in‑flight calls.
Tenancy	Dynamic Bindings created per tenant/environment via Hub API; Hub passes scoped credentials to adapter using short‑lived secrets.

Adapter Manifest (example)

apiVersion: ecs.connectsoft.io/v1alpha1
kind: AdapterManifest
metadata:
  adapterId: azure-appconfig
  version: 1.4.2
spec:
  provider: AzureAppConfig
  capabilities:
    hierarchicalKeys: true
    etagSupport: true
    watchStreaming: true
    transactions: false
    bulkList: true
    tags: ["labels","contentType"]
  inputs:
    - name: connection
      type: secretRef        # provided by Hub at Bind time
    - name: storeName
      type: string
  limits:
    maxKeySize: 1024
    maxValueSize: 1048576
    maxBatch: 500
  events:
    emits: ["ExternalChangeObserved","SyncJobCompleted"]
    consumes: ["ConfigPublished","BindingRotated"]
security:
  requiredScopes: ["kv.read","kv.write"]
  secretTypes: ["azure:clientCredentials"]

gRPC SPI (Service Provider Interface)¶

Proto (excerpt)

syntax = "proto3";
package ecs.adapters.v1;

message BindingRef {
  string tenant_id = 1;
  string environment = 2;     // dev|staging|prod
  string namespace = 3;       // app/service scope
  string binding_id = 4;      // unique per tenant+namespace
}

message CredentialEnvelope {
  string provider = 1;        // e.g., AzureAppConfig
  bytes  payload = 2;         // KMS-encrypted secret blob
  string kms_key_ref = 3;
  string version = 4;
  int64  not_after_unix = 5;  // lease expiry
}

message BindRequest {
  BindingRef binding = 1;
  CredentialEnvelope credential = 2;
  map<string,string> options = 3; // e.g., storeName, region, prefix
}

message Key {
  string path = 1;     // normalized ECS path: /apps/{app}/env/{env}/[...]
  string label = 2;    // provider-specific label/namespace
}

message Item {
  Key key = 1;
  bytes value = 2;     // arbitrary bytes (JSON, text, binary)
  string content_type = 3;
  string etag = 4;     // provider etag/version token
  map<string,string> meta = 5;
}

message GetRequest { BindingRef binding = 1; Key key = 2; string if_none_match = 3; }
message GetResponse { Item item = 1; bool not_modified = 2; }

message PutRequest {
  BindingRef binding = 1;
  Item item = 2;
  string if_match = 3;      // CAS on etag
  bool upsert = 4;
  string idempotency_key = 5;
}
message PutResponse { Item item = 1; }

message ListRequest { BindingRef binding = 1; string prefix = 2; int32 page_size = 3; string page_token = 4; }
message ListResponse { repeated Item items = 1; string next_page_token = 2; }

message DeleteRequest { BindingRef binding = 1; Key key = 2; string if_match = 3; string idempotency_key = 4; }
message DeleteResponse {}

message WatchRequest { BindingRef binding = 1; string prefix = 2; string resume_token = 3; }
message ChangeEvent {
  string change_id = 1;     // for idempotency
  string type = 2;          // upsert|delete
  Item item = 3;
  string resume_token = 4;  // bookmark for stream resumption
}

service ProviderAdapter {
  rpc RegisterManifest(google.protobuf.Empty) returns (AdapterInfo);
  rpc Bind(BindRequest) returns (BindingAck);
  rpc Validate(BindRequest) returns (ValidationResult);

  rpc Get(GetRequest) returns (GetResponse);
  rpc Put(PutRequest) returns (PutResponse);
  rpc Delete(DeleteRequest) returns (DeleteResponse);
  rpc List(ListRequest) returns (ListResponse);

  rpc Watch(WatchRequest) returns (stream ChangeEvent); // server-streamed
  rpc Health(google.protobuf.Empty) returns (HealthStatus);
  rpc Unbind(BindingRef) returns (google.protobuf.Empty);
}

Contract notes

Idempotency: idempotency_key required for Put/Delete. Adapters must persist idempotency window (e.g., 24h) to dedupe retries.
ETag/CAS: Use if_match/if_none_match for concurrency control where provider supports it; otherwise emulate with version vectors.
Pagination: mandatory stable ordering by key path; opaque page_token.

Binding & Namespacing¶

Each tenant/environment maps to a provider namespace:

Provider	Namespace Strategy
Azure AppConfig	`key = {prefix}:{app}:{env}:{path}`, `label = {namespace}` (labels for environment or slot)
AWS AppConfig	Application/Profile/Environment map to ECS `namespace`/`env`; versions map to deployments
Consul KV	`/{tenant}/{env}/{namespace}/{path}` with ACL tokens per binding
Redis	`key = ecs:{tenant}:{env}:{namespace}:{path}`, optional hash for grouping; TTL not used for config
CockroachDB (SQL)	Table per tenant partitioned by `{env, namespace}` with composite PK `(key, version)`

Key normalization

ECS canonical path: /apps/{app}/env/{env}/services/{svc}/paths/...
Adapters implement deterministic mapping and store reverse mapping in meta to support round‑trip and diffs.

Adapter Lifecycle (State Machine)¶

stateDiagram-v2
  [*] --> Discovered
  Discovered --> Registered: RegisterManifest()
  Registered --> Bound: Bind()+Validate()
  Bound --> Active: Health(ok)
  Active --> RotatingCreds: BindingRotated
  RotatingCreds --> Active: re-Validate()
  Active --> Draining: Unbind() requested / Upgrade
  Draining --> Unbound: no inflight
  Unbound --> [*]
  Active --> Degraded: Health(warn/fail)
  Degraded --> Active: Auto-recover/Retry
  Degraded --> Draining: Manual intervention

Hold "Alt" / "Option" to enable pan & zoom

Lifecycle hooks

Validate() must perform least‑privilege checks and a dry‑run (list one key, check write if permitted).
Unbind() drains Watch streams and completes in-flight mutations; Hub retries if deadlines expire.

Change Propagation & Sync¶

Flows

ECS → Provider (Publish) Registry emits ConfigPublished → Hub resolves bindings → Put() batch to adapter → on success, Hub emits SyncJobCompleted.
Provider → ECS (Observe) Adapter Watch(prefix) streams ChangeEvent for external drifts → Hub translates to ExternalChangeObserved → Policy Engine validates → Registry applies or flags drift.

sequenceDiagram
  participant Registry
  participant Hub
  participant Adapter as AzureAdapter
  participant Provider as AzureAppConfig

  Registry->>Hub: ConfigPublished(tenant/app/env)
  Hub->>Adapter: Put(batch, idempotency_key)
  Adapter->>Provider: PUT keys (CAS/etag)
  Provider-->>Adapter: 200 + etags
  Adapter-->>Hub: PutResponse(items)
  Hub-->>Registry: SyncJobCompleted

Hold "Alt" / "Option" to enable pan & zoom

Delivery semantics

At‑least‑once end‑to‑end with idempotency on adapter side.
Backpressure: Hub enforces max in‑flight per binding and exponential backoff per provider rate limits.
DLQ: Failed mutations or change translations publish to ecs.adapters.dlq with replay tokens.

Diffing, Snapshot, Rollback¶

List(prefix) yields provider snapshot; Hub computes 3‑way diff (desired, provider, last‑applied).
Rollback uses ECS Versioning to restore prior desired state; Hub re‑publishes with CAS on previous etags where possible.
Partial capability: If provider lacks ETag, Hub marks binding non‑concurrent, serializes writes.

Security & Secrets¶

Concern	Pattern
Credentials	Short‑lived `CredentialEnvelope` issued by Hub via cloud KMS/KeyVault/Secrets Manager; rotate via `BindingRotated` event.
Network	mTLS between Hub↔Adapter; egress to provider via provider’s SDK with TLS1.2+.
Least Privilege	Scoped roles per binding (read-only vs read/write).
Auditing	All SPI calls include traceId, tenantId, actor; adapters must log structured spans.
Data Privacy	No plaintext secrets in logs; PII scrubbing middleware in adapters.

Observability & Health¶

Metrics (per binding)

adapter_requests_total{op}
adapter_request_duration_ms{op}
adapter_throttle_events_total
adapter_watch_gaps_total (stream gaps/restarts)
adapter_idempotent_dedup_hits_total
adapter_errors_total{type=auth|rate|transient|fatal}

Health

HealthStatus returns status, since, and checks[] including auth, reachability, quota.

Provider‑Specific Nuances¶

Azure App Configuration¶

Use label for env/namespace, contentType for metadata.
Leverage If‑None‑Match for conditional GET; ETag on PUT/CAS.
Watch via push notifications (Event Grid) optional; otherwise adapter polling with ETag bookmarks.

AWS AppConfig¶

Publishing often uses Hosted Config with Deployments; adapter wraps version promotion rather than per‑key PUT.
Changes are profile+environment scoped; adapter maps ECS keys to a JSON document (schema‑validated).

Consul KV¶

Native blocking queries support; adapter implements Watch using index token (long‑poll).
ACL tokens per binding; transactions (if enabled) can apply atomic batches.

Redis¶

Use keyspace notifications where available; otherwise use SCAN + version fields.
Prefer HASH records for grouped configuration with a version field to emulate ETag.

SQL/CockroachDB¶

Schema (simplified):

create table config_items (
  tenant_id    uuid not null,
  environment  text not null,
  namespace    text not null,
  path         text not null,
  version      int8 not null default 1,
  content_type text,
  value        bytea not null,
  etag         text not null,
  updated_at   timestamptz not null default now(),
  primary key (tenant_id, environment, namespace, path)
);
create index on config_items (tenant_id, environment, namespace);

ETag computed as sha256(value||version); changes streamed via CDC if enabled.

Error Model¶

Code	Meaning	Hub Behavior
`ALREADY_EXISTS`	CAS conflict (etag mismatch)	Retry with latest `Get()` and policy merge
`FAILED_PRECONDITION`	Invalid binding/perm	Mark binding degraded, alert Security
`RESOURCE_EXHAUSTED`	Throttled/quota	Backoff (jitter), adjust batch size
`UNAVAILABLE`	Provider outage	Circuit break per binding; queue for replay
`DEADLINE_EXCEEDED`	Timeout	Retry with exponential backoff; consider reducing page/batch

All responses must include retry_after_ms hint where applicable.

Rate Limits & Concurrency¶

Hub defaults: 8 concurrent ops per binding, 256 per adapter process.
Adapter declares provider limits in Manifest; Hub tunes concurrency dynamically (token bucket).
Large publications use chunking with maxBatch from Manifest.

Compliance & Governance¶

Adapters must pass conformance tests:
- CRUD contract, idempotency, ETag/CAS behavior
- Watch stream continuity (resume token)
- Backpressure under synthetic throttling
- Multi‑tenant isolation (no cross‑leak with wrong credentials)
Supply SBOM and sign container images (Sigstore).

Example: Binding Creation (REST)¶

POST /v1/adapters/azure-appconfig/bindings
Content-Type: application/json
{
  "tenantId": "t-123",
  "environment": "prod",
  "namespace": "billing",
  "options": { "storeName": "appcfg-prod", "prefix": "ecs" },
  "credentialRef": "kv://secrets/azure/appcfg/billing-prod",
  "permissions": ["read","write"]
}

Hub resolves credentialRef → creates CredentialEnvelope → calls Bind() then Validate() on adapter.

Testing Matrix (Adapter Conformance)¶

Area	Scenario	Expected
Idempotency	Retry `Put()` with same idempotency_key	Exactly one write at provider
CAS	`Put(if_match=stale_etag)`	`ALREADY_EXISTS`, no mutation
Watch	Kill adapter, restart with `resume_token`	Stream resumes without loss
Pagination	`List()` with small `page_size`	Complete coverage, stable ordering
Drift	Provider side key modified	`ExternalChangeObserved` fired
Throttle	Provider returns 429/limit	Backoff, no hot‑loop

Solution Architect Notes¶

Sidecar vs central adapter: For high‑churn namespaces, deploy adapter sidecar to reduce latency; otherwise run shared adapters per provider with strong isolation at binding layer.
Schema guarantees: Where providers are document‑style (AWS AppConfig), define a document aggregator in Hub that flattens ECS keyspace into a single JSON with stable ordering and checksum for CAS.
Rollout safety: Pair publication with Refresh Orchestrator to gradually push SDK refresh (canary → region → all).
Cost & limits: Respect provider rate/quota; tune batch sizes from Manifest during runtime using feedback metrics.

This specification enables Engineering Agents to implement provider adapters with consistent contracts, safe lifecycles, and operational reliability across heterogeneous backends, while preserving ECS’s multi‑tenant isolation and event‑driven change propagation.

SDK Design (.NET / JS / Mobile) – APIs, Caching, Offline, Diagnostics, Feature Flags, Circuit Breakers¶

This section defines the client SDKs for ECS across .NET, JavaScript/TypeScript, and Mobile (MAUI / React Native). It translates platform-neutral behaviors (auth, tenant routing, caching, refresh, and resiliency) into implementable APIs with consistent cross-language semantics. SDKs are multi-tenant aware, edition aware, and aligned with Clean Architecture and observability-first principles.

Goals & Non-Functional Targets¶

Area	Target
Latency	p95 `GetConfig()` ≤ 50ms (warm cache), ≤ 250ms (cache miss, intra-region)
Availability	99.95% SDK internal availability (local cache + backoff)
Cache Correctness	Strong read-your-writes for same client session; eventual consistency ≤ 2s with server push/long-poll
Footprint	< 400KB minified JS; < 1.5MB .NET DLLs; < 800KB mobile binding (excl. deps)
Telemetry	OTEL spans for API calls; metrics for hit/miss, refresh latency, circuit state
Security	OIDC access token; least-privilege scopes; mTLS optional for enterprise
Backwards-Compat	SemVer with non-breaking minor releases; feature flags guarded by capability negotiation

High-Level Architecture¶

classDiagram
    class EcsClient {
        +GetConfig(key, options) : ConfigValue
        +GetSection(path, options) : ConfigSection
        +Subscribe(keys|paths, handler) : SubscriptionId
        +SetContext(Tenant, Environment, App, Edition)
        +FlushCache(scope?)
        +Diagnostics() : EcsDiagnostics
    }

    class Transport {
        <<interface>>
        +RestCall()
        +GrpcCall()
        +WebsocketStream()
        +LongPoll()
    }

    class CacheLayer {
        +TryGet(key) : CacheResult
        +Put(key, value, etag, ttl)
        +Invalidate(key|prefix)
        +PromoteToPersistent()
    }

    class OfflineStore {
        +Read(key) : OfflineResult
        +Write(snapshot)
        +Enumerate()
        +GC(policy)
    }

    class Resiliency {
        +WithRetry()
        +WithCircuitBreaker()
        +WithTimeouts()
        +WithJitterBackoff()
    }

    class FeatureFlags {
        +IsEnabled(flag, context) : bool
        +Variant(flag, context) : string|number
    }

    EcsClient --> Transport
    EcsClient --> CacheLayer
    EcsClient --> OfflineStore
    EcsClient --> Resiliency
    EcsClient --> FeatureFlags

Hold "Alt" / "Option" to enable pan & zoom

Public API Surface (Cross-Language Semantics)¶

Core Concepts¶

Context: { tenantId, environment, applicationId, edition }
Selector: key (exact) or path (hierarchical section).
Consistency Options: { preferCache, mustBeFreshWithin, staleOk, etag }
Subscription: Push updates for keys/paths via WS/long-poll.

.NET (C#)¶

public sealed class EcsClient : IEcsClient, IAsyncDisposable
{
    Task<ConfigValue<T>> GetConfigAsync<T>(string key, GetOptions? options = null, CancellationToken ct = default);
    Task<ConfigSection> GetSectionAsync(string path, GetOptions? options = null, CancellationToken ct = default);
    IDisposable Subscribe(ConfigSubscription request, Action<ConfigChanged> handler);
    void SetContext(EcsContext ctx);
    Task FlushCacheAsync(CacheScope scope = CacheScope.App, CancellationToken ct = default);
    EcsDiagnostics Diagnostics { get; }
}

public record GetOptions(TimeSpan? MustBeFreshWithin = null, bool PreferCache = true, bool StaleOk = false, string? ETag = null);
public record EcsContext(string TenantId, string Environment, string ApplicationId, string Edition);
public record ConfigValue<T>(T Value, string ETag, DateTimeOffset AsOf, string Source); // source: memory/persistent/network

SDK Registration

services.AddEcsClient(o =>
{
    o.BaseUrl = new Uri("https://api.ecs.connectsoft.io");
    o.Auth = AuthOptions.FromOidc(clientId: "...", authority: "...", scopes: new[] {"ecs.read"});
    o.Transport = TransportMode.GrpcWithWebsocketFallback;
    o.Cache = CacheOptions.Default with { MemoryTtl = TimeSpan.FromMinutes(5), Persistent = true };
    o.Resiliency = ResiliencyProfile.Standard;
});

JavaScript / TypeScript¶

import { EcsClient, createEcsClient, GetOptions } from "@connectsoft/ecs";

const ecs: EcsClient = createEcsClient({
  baseUrl: "https://api.ecs.connectsoft.io",
  auth: { oidc: { clientId: "...", authority: "...", scopes: ["ecs.read"] } },
  transport: { primary: "grpc-web", fallback: ["rest", "long-poll"] },
  cache: { memoryTtlMs: 300000, persistent: true },
  resiliency: "standard"
});

const value = await ecs.getConfig<string>("features/paywall/threshold", <GetOptions>{ mustBeFreshWithinMs: 1000 });

const unsub = ecs.subscribe({ keys: ["features/*"] }, change => {
  console.log("config changed", change);
});

Mobile (MAUI / React Native)¶

MAUI: NuGet package reuses .NET SDK; persistent store via SecureStorage + local DB.
React Native: NPM package uses AsyncStorage/SecureStore for persistent cache; push via WS.

Caching Strategy¶

Two-tier cache with ETag-aware coherency and stampede protection.

Layer	Backing	Default TTL	Notes
Memory	Concurrent in-proc map with version pins	5m	Fast path; per-key locks to prevent thundering herd
Persistent	Indexed DB (Web) / AsyncStorage (RN) / LiteDB/SQLite (MAUI)	24h (stale-read allowed)	Used when offline; writes are snapshots (`configSet`, `etag`, `asOf`)

Get Flow (with ETag/Stampede Guard)¶

sequenceDiagram
    participant App
    participant SDK as EcsClient
    participant Cache as Memory/Persistent
    participant Svc as ECS API

    App->>SDK: GetConfig(key, opts)
    SDK->>Cache: TryGet(key)
    alt Hit & Fresh
        Cache-->>SDK: value, etag
        SDK-->>App: value (source=memory)
    else Miss or Stale
        SDK->>SDK: Acquire per-key lock
        SDK->>Svc: GET /configs/{key} If-None-Match: etag
        alt 304 Not Modified
            Svc-->>SDK: 304
            SDK->>Cache: Touch(key)
            SDK-->>App: value (source=cache)
        else 200 OK
            Svc-->>SDK: value, etag
            SDK->>Cache: Put(key, value, etag, ttl)
            SDK-->>App: value (source=network)
        end
        SDK->>SDK: Release lock
    end

Hold "Alt" / "Option" to enable pan & zoom

Options

PreferCache: return cached immediately; refresh in background.
MustBeFreshWithin: ensure data age ≤ threshold; else block for refresh.
StaleOk: allow stale if network unavailable (offline mode).

Offline Mode¶

Detection: transport health + browser navigator.onLine/platform reachability signal.
Read: return most recent snapshot from persistent store if StaleOk or MustBeFreshWithin not satisfied but offline.
Write (Studio-authored server-side): SDKs are read-optimized; write APIs guarded and not available in runtime SDKs by default.
Reconciliation: on reconnect, diff by ETag and update persistent store; emit ConfigChanged events for subscribers.

flowchart LR
    Offline[[Offline]] --> ReadFromPersistent
    Reconnect[[Reconnect]] --> SyncETags --> UpdateMemory --> NotifySubscribers

Hold "Alt" / "Option" to enable pan & zoom

Refresh & Invalidation¶

Push: WS stream ConfigChanged (per tenant/app/keys). SDK updates memory and persistent stores; raises handlers on the UI thread (mobile) / event loop (JS).
Long-Poll: Backoff schedule with jitter; ETag aggregation checkpoint to minimize payload.
Manual: FlushCache(scope) clears memory and persistent entries by key, prefix, or app.

Subscriber Contract

type ConfigSubscription = { keys?: string[], paths?: string[], prefix?: string };
type ConfigChanged = { key: string, newETag: string, reason: "push"|"poll"|"manual" };

Diagnostics & Telemetry¶

OpenTelemetry built-in, disabled only via explicit opt-out.

Signal	Name	Dimensions
Trace span	`ecs.sdk.get_config`	tenant, app, key, cache_hit(bool), etag_present(bool), attempt
Metric (counter)	`ecs_cache_hits_total`	layer(memory/persistent), tenant, app
Metric (histogram)	`ecs_refresh_latency_ms`	transport(rest/grpc/ws), outcome
Metric (gauge)	`ecs_circuit_state`	service, state
Log	`ecs.sdk.error`	exception, httpStatus, circuitState, retryAttempt

Redaction: keys logged with hashing; values never logged unless Diagnostics().EnableValueSampling() is set in dev only.

User-Agent: ConnectSoft-ECS-SDK/{lang}/{version} ({os};{runtime})

Feature Flags API¶

Integrated lightweight evaluation with remote rules (from ECS) + local fallbacks.

public interface IEcsFlags
{
    bool IsEnabled(string flag, FlagContext? ctx = null);
    T Variant<T>(string flag, FlagContext? ctx = null, T defaultValue = default!);
}

public record FlagContext(string? UserId = null, string? Region = null, IDictionary<string,string>? Traits = null);

Data Source: features/* namespace within ECS.
Evaluation: client-side deterministic hashing (sticky bucketing), server-side override precedence.
Safety: if no rules available, fail-closed for IsEnabled (configurable).

Resiliency Patterns¶

Defaults provided by ResiliencyProfile.Standard (overridable per method).

Concern	Default
Timeouts	connect: 2s, read: 1.5s, overall: 4s
Retries	3 attempts, exponential backoff (100–800ms) + jitter
Circuit Breaker	consecutive failures ≥ 5 or failure rate ≥ 50% over 20 calls → Open 30s; Half-Open probe: 10%
Bulkhead	Max 64 concurrent inflight network calls per process
Fallback	Serve stale if available; else throw `EcsUnavailableException`

Circuit State Events Handlers may subscribe to OnCircuitChange(service, state, reason) for operational visibility.

Tenant & Edition Routing¶

Every request carries Context headers: x-ecs-tenant, x-ecs-environment, x-ecs-application, x-ecs-edition.
Edition shaping: SDK hides calls to non-entitled endpoints (pre-check via capability discovery at init).
Partition-Aware: Affinity to nearest region discovered via bootstrap (DNS SRV / discovery endpoint); persisted for session.

Configuration & Bootstrapping¶

Setting	.NET Key	JS Key	Default
Base URL	`Ecs:BaseUrl`	`baseUrl`	required
Transport	`Ecs:Transport`	`transport.primary`	`grpc` (.NET), `grpc-web` (web)
Auth	`Ecs:Auth:*`	`auth`	OIDC
Memory TTL	`Ecs:Cache:MemoryTtlSeconds`	`cache.memoryTtlMs`	300
Persistent	`Ecs:Cache:Persistent`	`cache.persistent`	true
Push	`Ecs:Refresh:PushEnabled`	`refresh.push`	true
Long-Poll	`Ecs:Refresh:PollIntervalMs`	`refresh.pollIntervalMs`	30000
Resiliency	`Ecs:Resiliency:Profile`	`resiliency`	`standard`

Error Model (Surface to Callers)¶

EcsAuthException (401/403)
EcsNotFoundException (404 key/path)
EcsConcurrencyException (ETag precondition failed)
EcsUnavailableException (circuit open / transport down; Inner includes last cause)
EcsTimeoutException
EcsValidationException (bad selector/context)

Each includes: traceId, requestId, correlationId, retryAfter (if applicable).

Packaging, Versioning, Governance¶

Lang	Package	Min Runtime	SemVer
.NET	`ConnectSoft.Ecs`	.NET 8	`MAJOR.MINOR.PATCH`
JS/TS	`@connectsoft/ecs`	ES2019+	Same
RN	`@connectsoft/ecs-react-native`	RN 0.74+	Same
MAUI	Uses .NET package	.NET 8 MAUI	Same

Capability Negotiation at init: server advertises supported transports and features → SDK enables/guards accordingly.
Deprecations: compile-time [Obsolete] (.NET) / JSDoc tags (JS) + runtime warning with link to migration docs.

Security Considerations¶

Tokens stored in OS-backed secure stores (Keychain/DPAPI/Keystore).
PII-safe telemetry; opt-in only for value sampling.
mTLS support via client cert injection (.NET/MAUI only, enterprise edition).
Scoped Claims enforced in gateway; SDK only forwards tokens.

Testability & Observability Hooks¶

Deterministic clocks and pluggable time providers for cache TTL tests.
Transport shim interface for mocking HTTP/gRPC.
In-memory store to simulate offline for E2E.
DiagnosticListener (.NET) / ecs.on(event, handler) (JS) to observe lifecycle events.

Example Usage Patterns¶

Config with Freshness Guard & Fallback

var config = await ecs.GetConfigAsync<int>("limits/maxSessions",
    new GetOptions(MustBeFreshWithin: TimeSpan.FromSeconds(1), PreferCache: true, StaleOk: true), ct);

Feature Flag Toggle in UI

if (ecs.flags.isEnabled("ui:newOnboarding", { userId })) {
  renderNewFlow();
} else {
  renderClassic();
}

Subscription to Section Changes

using var sub = ecs.Subscribe(
    new ConfigSubscription { Prefix = "features/" },
    change => logger.LogInformation("Changed: {Key} {ETag}", change.Key, change.NewETag));

Solution Architect Notes¶

Docs to add: per-language quickstarts, migration guide for transport changes, capability negotiation matrix by edition.
Optional extension: pluggable policy evaluators in SDK to locally enforce tenant policy hints (read-only apps, canary % caps).
Performance harness: include standard bench suite for cache hit/miss paths using recorded timelines from staging.

Config Studio UI — IA, Screens, Workflows, Policy Editor, Guardrails, Approvals, Audit Trails¶

Objectives¶

Deliver an admin-first Studio to author, validate, approve and publish tenant configuration safely. The Studio is a single-page app (SPA) backed by the ECS Gateway/YARP (BFF), enforcing tenant/edition/env scopes and segregation of duties with full auditability.

Information Architecture (IA)¶

flowchart LR
  root[Studio Home]
  Ten[Tenant Switcher]
  Env[Environment Switcher]
  Dash[Dashboard]
  CS[Config Sets]
  Items[Items]
  Snaps[Snapshots & Tags]
  Dep[Deployments]
  Pol[Policy Editor]
  App[Approvals]
  Aud[Audit & Activity]
  Ops[Operations: Backups/Imports]
  Acc[Access & Roles]

  root-->Dash
  root-->CS-->Items
  CS-->Snaps
  Snaps-->Dep
  root-->Pol
  root-->App
  root-->Aud
  root-->Ops
  root-->Acc
  root-->Ten
  root-->Env

Hold "Alt" / "Option" to enable pan & zoom

Global chrome

Tenant switcher (only tenants user is entitled to).
Environment switcher (dev/test/staging/prod).
Edition badge (Free/Pro/Enterprise) with effective policy overlays.
Context pills: tenant / application / namespace / env.

Role-Based UX & Guarded Actions¶

Role	Primary Views	Write Capabilities	Guardrails
Viewer	Dashboard, Config Sets, Snapshots, Audit	None	Cannot reveal secret values; masked view only
Developer	+ Items, Diff, Local Preview	Draft edits, save, request approval	Schema validation must pass; breaking changes require approval
Approver	Approvals, Policy summaries	Approve/Reject, schedule	SoD: cannot approve own changes; change window enforcement
Tenant Admin	+ Policies, Access, Ops	Tagging, alias pins, rollback, import/export	Two-person rule for prod; high-risk actions gated
Platform Admin	Cross-tenant operations	Break-glass overrides	Time-boxed, audited with ticket reference

Key Workflows¶

Draft → Validate → Approve → Publish → Deploy → Refresh¶

sequenceDiagram
  participant Dev as Developer
  participant Studio as Studio UI
  participant API as ECS API (BFF)
  participant Policy as Policy Engine
  participant Approver as Approver
  participant Registry as Config Registry
  participant Orchestrator as Refresh Orchestrator

  Dev->>Studio: Edit draft (items)
  Studio->>API: PUT /config-sets/{id}/items (draft)
  Dev->>Studio: Validate
  Studio->>Policy: POST /policies/validate (draft+context)
  Policy-->>Studio: Result (violations, riskScore)
  Dev->>Studio: Request approval
  Studio->>API: POST /approvals (changeSet, riskScore)
  Approver->>Studio: Review diff, policy summary
  Approver->>API: POST /approvals/{req}:approve (window, notes)
  API->>Registry: POST /config-sets/{id}/snapshots (Idempotency-Key)
  Registry-->>Studio: Snapshot created (etag, version)
  API->>Registry: POST /deployments (env, canary?)
  Registry->>Orchestrator: ConfigPublished
  Orchestrator-->>Studio: Deployment status & refresh signal

Hold "Alt" / "Option" to enable pan & zoom

Decision points

Risk scoring (see Guardrails) drives required approvals count.
Change windows block immediate publish to prod if outside permitted window.

Screens & Routes¶

Screen	Purpose	Route	Key Backend Calls
Dashboard	Tenant health, recent changes, pending approvals	`/dashboard`	`GET /config-sets?limit=5`, `GET /approvals?status=pending`
Config Sets	List/search sets; create	`/config-sets`	`GET /config-sets`, `POST /config-sets`
Items Editor	Edit key/values with schema hints	`/config-sets/:id/items`	`GET /config-sets/{id}/items`, `PUT /items/{key}`, `POST /items:batch`
Diff & Validation	Compare working vs snapshot; policy violations	`/config-sets/:id/diff`	`POST /diff`, `POST /policies/validate`
Snapshots & Tags	View versions, tag SemVer, manage aliases	`/config-sets/:id/snapshots`	`GET /snapshots`, `POST /snapshots`, `PUT /tags`, `PUT /aliases/*`
Deployments	Create/monitor deployments; canary waves	`/deployments`	`POST /deployments`, `GET /deployments`
Approvals Inbox	Review requests; approve/reject with notes	`/approvals`	`GET /approvals`, `POST /approvals/{id}:approve	reject`
Policy Editor	Edit rules, schemas, edition overlays	`/policies`	`GET/PUT /policies`, `GET/PUT /schemas`
Audit & Activity	Tamper-evident timeline, filters, export	`/audit`	`GET /audit?filter=...`, export
Operations	Import/export, backups, snapshots	`/ops`	`POST /snapshots/export`, `POST /import`
Access & Roles	Assign roles; view effective permissions	`/access`	`GET/PUT /roles`, `GET /whoami`

Items Editor — UX Details¶

Tree + Table hybrid: hierarchical left nav (paths), key table with inline type detection.
Content types: JSON (editor with schema-aware IntelliSense), YAML, Text.
Validation overlay: inline markers (red=blocking, amber=warning).
Secrets: write-only fields using secret references (kvref://…), never show plaintext.
ETag awareness: editor surfaces server ETag; PATCH requires If-Match.

Policy Editor¶

Modes

Visual Builder: conditions, targets, effects (allow/deny/transform).
Code View: DSL/JSON with schema, auto-format & lint.
Test Bench: input context (tenant, env, labels) → evaluate → see result and trace.

Artifacts

JSON Schema per namespace/path; used for editor IntelliSense & validation.
Edition Overlays: Pro/Enterprise feature gates, default limits, quotas.
Change Windows: cron-like rules per environment.

Sample Policy (DSL excerpt)

rule: deny-breaking-changes-in-prod
when:
  env: "prod"
  change.severity: "breaking"
then:
  deny: true
  message: "Breaking changes require CAB approval"

Guardrails & Risk Scoring¶

Guardrail	Description	Severity	Enforcement
Schema violations	JSON Schema errors	Blocking	Must fix before approval
Breaking change	Remove/rename keys	High	Requires 2 approvals + change window
Secrets handling	Plaintext secret detected	High	Block; enforce kvref usage
Blast radius	Affects >N services	Medium/High	Extra approver or canary required
Change budget	Too many changes per window	Medium	Auto-suggest canary
Out-of-window publish	Not in allowed window	Medium	Schedule or request override

Risk Score = Σ(weight × finding) → Approval policy:

Score < 20 → 1 approver (owner/lead)
20–49 → 2 approvers (different role/team)
≥ 50 → CAB (admin + approver; SoD enforced)

Approvals Model¶

SoD: approver cannot be the author; cannot share same group if risk ≥ 20.
Delegations: time-boxed with reason & ticket link.
Scheduling: approval can schedule publish/deploy in a future change window.
Evidence: approval captures diff, policy results, tests, and risk score snapshot.

Approval View

Left: Diff summary grouped by severity.
Right: Policy findings, checks passed/failed, required approvers checklist.
Footer: Approve/Reject with notes; Require Canary toggle.

Audit Trails¶

Tamper-evident timeline

Events hash-chained; show chain integrity indicator.
Filters: actor, resource, action, risk, status, env, time window.
Drill-down to raw CloudEvent and outbox record.

Exports

CSV/Parquet; signed export manifest with hash of contents.
Redaction presets (no tenant PII, no secret refs).

Typical events

DraftEdited, ValidationRun, ApprovalRequested/Granted/Rejected, SnapshotCreated, DeploymentStarted/Completed, RollbackInitiated, PolicyUpdated, AccessChanged.

Accessibility, Usability & Internationalization¶

WCAG 2.1 AA: keyboard-first navigation, visible focus, ARIA roles.
i18n: ICU message format; RTL support; date/number locale aware.
Dark mode with contrast ≥ 7:1 for critical toasts/errors.

UI Architecture & Runtime¶

flowchart LR
  SPA[Config Studio SPA]
  BFF[YARP / Studio BFF]
  API[Registry/Policy APIs]
  BUS((Event Bus))
  WS[WS Bridge]

  SPA--REST/gRPC-->BFF
  BFF--mTLS-->API
  SPA--WS/SSE-->WS
  API--CloudEvents-->BUS
  WS--subscribes-->BUS

Hold "Alt" / "Option" to enable pan & zoom

State management: RTK Query (web) / Recoil; optimistic updates for drafts.
Transport: gRPC-web for reads; REST for writes with Idempotency-Key.
Security: OIDC PKCE; refresh tokens stored in secure storage; CSP locked; SRI for static assets; iframe embedding denied.

Error Handling & Toaster Patterns¶

Blocking errors bubble to Problem Details panel with traceId.
Non-blocking validation issues remain inline with links to policy docs.
Retry affordances for transient (429/503) with backoff hint.

Telemetry & Diagnostics¶

User actions → ecs.studio.action (create-draft, request-approval, approve, publish).
Latency → editor validate, diff compute, publish end-to-end.
UX health → WS reconnects, long-poll timeouts, save conflicts (412s).
Session correlation → x-correlation-id propagated to API and CloudEvents.

Performance Targets¶

Diff compute ≤ 300 ms for 5k-key sets (WebAssembly json-patch optional).
Validation roundtrip ≤ 500 ms p95.
Page initial load < 2.5 s on 3G fast, TTI < 4 s.

Acceptance Criteria (engineering hand-off)¶

Routes and components scaffolded with guards per role.
Items editor with schema-aware linting, secret reference input, ETag display.
Policy editor (visual + code) + test bench with recorded contexts.
Approvals workflow with SoD, risk scoring, change windows, and scheduling.
Audit timeline with hash-chain indicator and export.
End-to-end tests for: validate→approve→publish→deploy; rollback with approvals; out-of-window scheduling.

Solution Architect Notes¶

Prefer server-driven UI metadata (schemas, overlays, limits) to keep Studio thin and edition-aware without redeploys.
Introduce “Safe Preview” mode: materialize resolved config for a sandbox service to validate integration before publish.
Consider conflict-free replicated drafts (CRDT) if real-time multi-editor collaboration becomes a requirement.

Policy & Governance — schema validation, rules engine (edition/tenant/env), approvals, SoD, change windows¶

Objectives¶

Define the authoritative policy system that governs how configuration is authored, validated, approved, and published across tenants, editions, and environments. Provide implementable contracts for schema validation, rules evaluation, approvals/SOD, and change windows, with deterministic enforcement points and full auditability.

Architecture Overview¶

flowchart LR
  Studio[Config Studio UI]
  GW[API Gateway / Envoy]
  Policy[Policy Engine (PDP)]
  Registry[Config Registry]
  Approvals[Approvals Service]
  Bus[(Event Bus)]

  Studio -- validate/diff --> Policy
  GW -- ext_authz --> Policy
  Registry -- pre-publish validate --> Policy
  Policy -- obligations --> GW
  Policy -- CloudEvents --> Bus
  Studio --> Approvals
  Approvals --> Registry

Hold "Alt" / "Option" to enable pan & zoom

Principles

Policy-as-code with signed, versioned bundles.
Deterministic evaluation order and idempotent outcomes.
Zero-trust: every sensitive operation checked at the edge or server via PDP.
Observability-first: decisions are explainable (inputs, rules, obligations).

Policy Artifacts¶

Artifact	Purpose	Format	Storage
JSON Schema	Structural validation of config items/sets	JSON Schema 2020-12	Registry (per namespace), cached
Rule bundles	Declarative allow/deny/obligate decisions	YAML/JSON (ECS DSL) or Rego (OPA mode)	Policy repo → Policy Engine
Edition Overlays	Feature gates, limits per plan	YAML/JSON	Registry/Policy
Approval Policies	Required approvers, SoD, thresholds	YAML	Policy Engine
Change Windows	Allowed publish/deploy windows	cron + TZ (IANA)	Policy Engine
Risk Scoring	Weighted findings → required gates	YAML	Policy Engine

The engine supports native ECS DSL (recommended) and OPA/Rego as a compatibility mode. Only one mode enabled per deployment.

Evaluation Model¶

Inputs (decision context)

tenantId, editionId, environment, appId, namespace
actor { sub, roles[], groups[], mfa, isBreakGlass }
operation (e.g., config.publish, config.modify, policy.update)
change (diff summary: breaking/additive/neutral; counts)
metadata (labels, tags, size, affected services)
time (UTC instant + tenant-local TZ)

Order

Schema Validation (blocking errors)
Policy Rules (allow/deny + obligations)
Edition Overlays (transform/limits)
Approvals & SoD (may issue obligation: requires_approvals=n)
Change Windows (may convert allow → scheduled)
Rate & Quota Checks (edition quotas; ties to Gateway)

Outcomes

ALLOW | DENY | ALLOW_WITH_OBLIGATIONS
Obligations: { approvals: {required: n, roles: [...]}, scheduleAfter: <ISO8601>, requireCanary: true, maxBlastRadius: N }
Explanation: list of matched rules & reasons.

Contracts — Policy Engine (PDP)¶

REST (gRPC analogs provided)

POST /v1/validate → { violations[] } (JSON Schema + custom keywords)
POST /v1/decide → { effect, obligations, explanation[] }
POST /v1/risk → { score, findings[] }
GET /v1/bundles/:id → policy artifact materialization (signed)

Decision request (excerpt)

{
  "tenantId": "t-123",
  "editionId": "pro",
  "environment": "prod",
  "operation": "config.publish",
  "actor": { "sub": "u-9", "roles": ["developer"], "groups": ["team-billing"], "mfa": true },
  "change": { "breaking": 1, "additive": 12, "neutral": 5, "affectedServices": 4 },
  "time": "2025-08-25T10:00:00Z",
  "tz": "Europe/Berlin",
  "metadata": { "labels": ["payments","blue"] }
}

Decision response (excerpt)

{
  "effect": "ALLOW_WITH_OBLIGATIONS",
  "obligations": {
    "approvals": { "required": 2, "roles": ["approver","tenant-admin"], "sod": true },
    "requireCanary": true,
    "scheduleAfter": null
  },
  "explanation": [
    "rule:deny-breaking-without-2-approvals matched",
    "rule:prod-requires-change-window matched (window active)"
  ]
}

Schema Validation¶

JSON Schema 2020-12 per namespace/path with $ref composition.
Custom keywords:
- x-ecs-secretRef: true → value must be kvref://...
- x-ecs-maxBlastRadius: <int> → used in risk scoring
- x-ecs-deprecated: "message" → warning surface only
Where enforced: Studio save, CI import, Registry publish, Adapter sync.
Performance target: ≤ 200 ms p95 for 5k-key set validation.

Example (snippet)

{
  "$id": "https://schemas.connectsoft.io/ecs/db.json",
  "type": "object",
  "properties": {
    "connectionString": { "type": "string", "x-ecs-secretRef": true },
    "poolSize": { "type": "integer", "minimum": 1, "maximum": 200 }
  },
  "required": ["connectionString"]
}

Rules Engine (ECS DSL)¶

Structure

apiVersion: ecs.policy/v1
kind: RuleSet
metadata: { name: tenant-prod-guardrails, tenantId: t-123 }
spec:
  rules:
    - name: deny-plaintext-secrets
      when: op == "config.publish" && env == "prod" && any(change.paths, . endsWith "connectionString" && !isSecretRef(.value))
      effect: DENY
      message: "Secrets must be kvref:// references"

    - name: breaking-needs-2-approvals
      when: op == "config.publish" && env == "prod" && change.breaking > 0
      effect: ALLOW_WITH_OBLIGATIONS
      obligations:
        approvals.required: 2
        approvals.roles: ["approver","tenant-admin"]
        approvals.sod: true
        requireCanary: true

    - name: pro-plan-limits
      when: edition == "pro"
      effect: ALLOW_WITH_OBLIGATIONS
      obligations:
        quotas.maxConfigKeys: 10000
        quotas.maxRpsRead: 600

    - name: prod-change-window
      when: op in ["config.publish","deploy.start"] && env == "prod" && !withinWindow("sat-22:00..sun-04:00", tz)
      effect: ALLOW_WITH_OBLIGATIONS
      obligations:
        scheduleAfter: nextWindow("sat-22:00..sun-04:00", tz)

Operators

Logical: && || !
Comparators: == != > >= < <= in
Helpers: any()/all(), withinWindow(), nextWindow(), isSecretRef()
Context vars: op, env, edition, tenantId, labels[], change.*

Edition/Tenant/Environment Overlays¶

Precedence global < edition < tenant < environment < namespace

Edition gates: feature toggles, quotas (RPS/keys/TTL), security posture (mTLS required).
Tenant customizations: additional controls (SoD strictness, approver roles).
Environment overrides: windows, rollout strategies.
Namespace: schema variants and local guardrails.

Bundles compiled into a single effective ruleset per (tenant, environment) at evaluation time; cached by PDP.

Approvals & Segregation of Duties (SoD)¶

Model

Approval objects created when obligations require it.
Approver pools resolved from roles/groups; SoD enforces author ≠ approver; if riskScore ≥ threshold, approvers must come from distinct groups.
Escalation: time-boxed delegation possible (delegatedTo, expiresAt, ticket).
Evidence: risk report, diff patch, validation output embedded.

Approval Matrix (default)

Condition	Required Approvals	SoD	Notes
Non-prod, non-breaking	0	—	Immediate publish
Prod, additive only	1	✓	Approver role
Prod, breaking	2	✓✓	Approver + Tenant Admin; canary required
High blast radius (> N services)	2	✓✓	Or schedule inside window
Emergency (break-glass)	1 (Platform Admin)	—	Auto-create post-incident review task

Change Windows¶

Defined per tenant/environment using CRON+TZ or range expressions.
PDP helper withinWindow(window, tz) respects IANA time zones and DST.
Enforcement: outside window, decisions add scheduleAfter obligation; Platform Admin can override with isBreakGlass=true (audited).

Examples

windows:
  prod:
    - name: weekend-window
      cron: "0 22 * * SAT"    # start
      duration: "6h"
      tz: "Europe/Berlin"
    - name: freeze
      range: "2025-12-20T00:00..2026-01-05T23:59"
      allow: false

Risk Scoring¶

Findings → score (weights configurable)

Schema violations: blocking (no score)
Breaking changes: +30
Secrets change count: +5 each (cap 30)
Blast radius (services affected): +1 each (cap 20)
Out-of-window: +10
Author role (developer vs admin): +0/−5 (experience credit)
Test coverage signal (from CI): −10 if integration tests pass

Thresholds

<20 → no approval
20–49 → 1 approval
≥50 → 2 approvals + canary

Enforcement Points¶

Point	What is enforced	How
Gateway (ext_authz)	AuthZ decision for admin APIs; edition quotas	`POST /v1/decide` with `operation` mapping (`config.modify`, `config.delete`)
Registry (pre-publish)	Schema + rules for publish/rollback	`validate` + `decide` with `operation=config.publish`
Studio (UX)	Early validation, approver hints	`validate` + `risk` for live feedback
Approvals Service	Satisfy obligations	Enforce SoD, windows, delegations
Adapter Hub	Provider sync guardrails	`decide(operation=adapter.sync)` for tenants with extra controls

Storage & Versioning¶

PolicyBundle(id, version, hash, signedBy, createdAt, scope={tenant|global})
EffectivePolicy(tenantId, env, bundleId, compiledHash, etag) — hot cache in PDP
Audit every decision, including inputs, effect, obligations, explanation, duration (ms)
Signing with Sigstore or KMS-backed JWS; PDP rejects unsigned or stale bundles.

Observability & Audit¶

Metrics: pdp.decisions_total{effect}, pdp.duration_ms (hist), pdp.cache_hit_ratio, approvals.pending, windows.scheduled_ops
Logs: structured with decisionId, tenantId, operation, rulesMatched[]
CloudEvents:
- ecs.policy.v1.PolicyUpdated
- ecs.policy.v1.DecisionRendered (optional sampling)
- ecs.policy.v1.ApprovalRequired/Granted/Rejected
- ecs.policy.v1.WindowScheduleCreated

Performance & Availability Targets¶

PDP decision p95 < 5 ms (cached), p99 < 20 ms (cold compile)
Validate+decide in publish flow < 150 ms p95 for 5k-key diff
PDP HA: 3 replicas/region, sticky by (tenantId, env); warm compiled cache on deploy
Cache TTL: bundle cache 60s; revalidate on PolicyUpdated event

Test Matrix (conformance)¶

Area	Scenario	Expectation
Schema	invalid secret ref	`validate` fails, code `SCHEMA.SECRET_REF`
Rules	breaking change prod	`ALLOW_WITH_OBLIGATIONS`, approvals=2, canary=true
Windows	publish outside window	obligation `scheduleAfter != null`
SoD	author approves own change	rejection `SOD.VIOLATION`
Edition	pro tier quotas exceeded	`DENY`, code `QUOTA.EXCEEDED`
Risk	high blast radius	score ≥ 50 → 2 approvals

Acceptance Criteria (engineering hand-off)¶

PDP service with /validate, /decide, /risk implemented (REST + gRPC), OTEL enabled.
DSL parser & evaluator with helpers (withinWindow, isSecretRef, nextWindow).
Bundle signer/loader + hot-reload on PolicyUpdated.
Gateway ext_authz integration mapping route → operation.
Approvals service honoring obligations with SoD and scheduling; Studio UI surfaces requirements.
End-to-end tests for publish with approvals, out-of-window scheduling, and rollback governance.

Solution Architect Notes¶

Choose ECS DSL as primary for readability; keep OPA compatibility flag for enterprises with existing Rego policies.
Define a policy pack per edition and allow tenant overrides only by additive (stricter) rules, never weakening base controls.
Add simulation mode in Studio: run decide against a proposed change to preview approvals/windows before requesting them.

Security Architecture — data protection, secrets, KMS, key rotation, multi-tenant isolation, threat model¶

Objectives¶

Establish a defense-in-depth security design for ECS aligned to ConnectSoft’s Security-First, Compliance-by-Design approach. This section operationalizes data protection, secrets/KMS, key rotation, multi-tenant isolation, and a threat model that maps to concrete controls, runbooks, and acceptance criteria.

Trust Boundaries & Security Plan¶

flowchart LR
  subgraph Internet
    User[Studio Users]
    SDKs[SDKs/Services]
  end

  WAF[WAF/DoS Shield]
  Envoy[API Gateway (JWT/JWKS, RLS, ext_authz)]
  YARP[YARP (BFF/Internal Gateway)]

  subgraph Core[ECS Services]
    REG[Config Registry]
    POL[Policy Engine (PDP)]
    ORC[Refresh Orchestrator]
    HUB[Provider Adapter Hub]
  end

  CRDB[(CockroachDB)]
  REDIS[(Redis Cluster)]
  BUS[(Event Bus)]
  KMS[(Cloud KMS/HSM)]
  VAULT[(Secret Stores: KeyVault/SecretsManager)]
  LOGS[(Audit/Telemetry Store)]

  User-->WAF-->Envoy-->YARP-->REG & POL & ORC
  SDKs-->WAF-->Envoy
  REG--mTLS-->CRDB
  REG--mTLS-->REDIS
  ORC--mTLS-->BUS
  HUB--mTLS-->Providers[(Azure/AWS/Consul/Redis/SQL)]
  REG--sign/enc-->LOGS
  Services--KEK/DEK ops-->KMS
  Services--secret refs-->VAULT

Hold "Alt" / "Option" to enable pan & zoom

Controls by boundary

Edge: WAF + IP reputation, geo & residency gate, TLS 1.2+, JWT validation (aud/iss/exp/nbf), rate limits.
Service mesh: mTLS (SPIFFE/SPIRE or workload identity), least-privilege network policies.
Data: encryption at rest, envelope encryption for sensitive blobs, append-only audits, tamper-evident chains.

Data Classification & Protection¶

Class	Examples	Storage	At Rest	In Transit	Additional
Public	API docs, OpenAPI	Object store/CDN	Provider default	TLS	CSP/SRI
Internal	Metrics, non-PII logs	LOGS	Disk enc. + log signing (optional)	TLS	PII redaction
Confidential	Config values (non-secret), diffs	CRDB JSONB	TDE/cluster enc	mTLS	ETag/ETag-hash only
Restricted	Secret refs, tokens, credentials	Vault/KMS	KMS-backed	TLS/mTLS	No plaintext in DB
Audit-critical	Approvals, policy decisions	CRDB (append-only)	Enc + hash-chain	TLS/mTLS	WORM export optional

Policy: No plaintext secrets are persisted in Snapshots; only kvref URIs (see below).

Secrets Architecture¶

Secret References (`kvref://`)¶

Format: kvref://{provider}/{path}#version?opts
- Example: kvref://vault/billing/db-conn#v3
Resolved at read time by Resolver/SDK with caller’s auth context; never materialized into snapshots.
Secret providers: Azure Key Vault, AWS Secrets Manager, optional HashiCorp Vault.

Resolution flow

Registry materializes canonical config with secret refs.
SDK/Resolver sees kvref://… keys and fetches from provider using short-lived credentials (workload identity or token exchange).
Values cached in SDK memory only (no persistent secret caching). TTL from provider.

Guardrails

Schema keyword x-ecs-secretRef: true required for fields like connectionString.
Studio blocks plaintext entry; provides secret picker UI.

Key Management (KMS/HSM) & Crypto Posture¶

Key Hierarchy (Envelope Encryption)¶

graph TD
  CMK[Root Customer Managed Key (per region)] -->|wrap/unwrap| TMK[Tenant Master Key (per tenant)]
  TMK -->|wrap| DEK[Data Encryption Keys (resource-scoped)]
  DEK -->|encrypt| SENS[Sensitive blobs (e.g., idempotency records with PII hints, optional)]

Hold "Alt" / "Option" to enable pan & zoom

CMK: Regional KMS/HSM (e.g., Azure Key Vault Managed HSM / AWS KMS CMK).
TMK: Derived per tenant; rotatable without re-encrypting data (envelope rewrap).
DEK: Per table/feature or per artifact class; rotated automatically on cadence.

Cryptographic Controls¶

TLS 1.2/1.3, strong ciphers; OCSP stapling at edge.
Hashing: SHA-256 for ETags and lineage hashes; HMAC-SHA256 for value integrity in Redis.
Optional: JWS signing of Snapshots/Audit exports (future toggle).

Rotation & Credential Hygiene¶

Asset	Rotation	Method	Blast Radius Control
JWKS (IdP keys)	90 days	Add new `kid`, overlap 24h, retire old	Edge monitors `jwt_authn_failed`
API Keys (webhooks)	90 days	Dual-key period; HMAC header `sha256=`	Per-tenant scope
KMS CMK	Yearly	Automatic rotation	Rewrap TMKs opportunistically
TMK (per tenant)	180–365 days	Rewrap DEKs in background	Tenant-scoped jobs
Service creds (bus/redis/sql)	90 days	Workload identity preferred; otherwise secret rotation + pod bounce	Per namespace
Redis ACLs	90 days	Dual ACL users; rotate, then drop	Per region
Adapter bindings	90 days	Lease-based `CredentialEnvelope.not_after`	Auto-rebind

Runbook excerpts

Key rollover: publish new JWKS; canary Envoy; monitor; retire old.
TMK rewrap: schedule low-traffic window; checkpoint progress; pause tenant writes if needed (rare).

Multi-Tenant Isolation¶

Data Plane¶

CRDB: REGIONAL BY ROW with tenant_id in PK; repository guards enforce tenant scoping.
Redis: key prefixes ecs:{tenant}:{env}:…; ACLs restrict prefixes per tenant for dedicated caches.
Events: CloudEvents carry tenantid extension; per-tenant topics/partitions to avoid cross-talk.

Control Plane¶

JWT claims: tenant_id, scopes; ext_authz checks at Envoy; PDP obligations enforce edition limits.
Studio: SoD & role checks; no cross-tenant UI elements.

Compute & Network¶

K8s namespaces per environment; NetworkPolicies deny east-west by default; adapters run out-of-process with mTLS.
SPIFFE identities: spiffe://ecs/{service}/{env}; RBAC by SPIFFE ID for inter-service calls.

Isolation Tests (every release)¶

Negative tests: attempt cross-tenant read/write via forged headers → expect 403.
Timing analysis: ensure response timing doesn’t leak tenant existence.
Event leakage: subscribe to foreign tenant topic → no messages.

Threat Model (STRIDE→Controls)¶

Threat	Example	Control (Design)	Detective
Spoofing	Forged JWT	Envoy JWT verify (aud/iss/kid); mTLS internal; token exchange with `act` chain	AuthN failure metrics, anomaly IP alerts
Tampering	Modify snapshot content	Append-only tables; ETag content hash; optional JWS signatures; change requires new version	Hash verification on read; audit hash-chain check
Repudiation	“I didn’t approve”	Signed decisions; SoD; immutable audit with actor `sub@iss`; IP/device fingerprint	Audit dashboards, integrity checks
Information disclosure	Secret leakage in logs	Secret refs only; structured logging with redaction; no secrets persisted	PII/secret scanners in CI and runtime
DoS	Resolve flood or publish storm	RLS per tenant; SDK backoff; long-poll waiter caps; cache SWR; circuit breakers	429/latency SLO alerts, waiter pool saturation alarms
Elevation of privilege	Dev acts as admin	Role->scope mapping; PDP policies; break-glass with time-box + extra audit	Privileged action audit, approval anomalies
Supply chain	Image tamper	SBOM + cosign signing; admission policy; CVE scanning; base images pinned	Image signature verification, scanner alerts
SSRF via Adapters	Malicious provider URL	Allowlist provider endpoints; egress policies; input validation & timeouts	Egress deny logs
Replay	Replayed webhook/API calls	HMAC with timestamp; Idempotency-Key for POST; short clock skew	Replay detector metrics
Data residency	EU data in US	Regional routing; `crdb_region` pin; residency policy at PDP	Residency audit reports

Application-Level Security Controls¶

Input validation: JSON Schema enforcement server-side; size caps; content-type allowlist.
Output encoding/CORS: locked origins for Studio; headers X-Content-Type-Options, X-Frame-Options, Referrer-Policy.
Least-privilege IAM: service principals scoped to exact resources; adapter credentials per-binding.
Secure defaults: TLS required; HTTP downgraded traffic rejected; same-site cookies (Lax or Strict) for Studio.

Observability & Evidence¶

OTEL everywhere; security spans tag: tenantId, actor, decisionId, effect.
Security metrics: authz_denied_total, jwt_authn_failed_total, rate_limited_total, policy_decision_ms, secret_ref_resolution_ms.
Audit exports: signed/hashed; scheduled to WORM storage if enabled.

Incident Response (IR) & Playbooks¶

Key compromise (tenant TMK)

Quarantine tenant (PDP deny writes; read-only).
Rotate TMK; rewrap DEKs; re-issue creds.
Force SDK token renewal; invalidate Redis keys; emit tenant-wide RefreshRequested.
Forensics: export audit; notify tenant.

Web token abuse

Revoke client app; rotate JWKS if issuer compromised.
Invalidate refresh tokens; raise auth challenge requirements (MFA).
Search logs for sub anomalies; notify tenant.

Data exfil suspicion

Activate access freeze policy; export and verify audit hash chain.
Run residency check; compare CRDB region distribution.

Compliance Hooks¶

Access reviews: quarterly role attestations; report who has tenant.admin/platform-admin.
Data subject requests (DSAR): audit index supports by-tenant export.
Pen-testing cadence: semi-annual + post-major release; findings tracked in risk register.
Third-party adapters: vendor security questionnaire + sandbox tenancy.

Acceptance Criteria (engineering hand-off)¶

mTLS enabled service-to-service (SPIFFE) with policy-enforced identities.
Secret refs (kvref://) enforced by schema; resolver libraries for .NET/JS/Mobile implemented.
KMS envelope encryption library with CMK/TMK/DEK operations + rewrap CLI.
Rotation jobs & runbooks codified (JWKS, TMK, service creds); canary + rollback procedures.
Tenant isolation tests automated in CI; event leakage tests in staging.
WAF, RLS, ext_authz configured; security dashboards and alerts live.
SBOM generation and image signing required in CI; admission policy verifies signatures.

Solution Architect Notes¶

Evaluate confidential computing (TEE) for server-side resolution of secrets for high-assurance tenants.
Consider per-tenant Redis in Enterprise for stronger isolation at cost increase.
Add JWS snapshot signing behind a feature flag to strengthen non-repudiation.
Plan red team exercise focused on adapter SSRF and multi-tenant boundary bypass.

Observability — OTEL spans/metrics/logs, trace taxonomy, dashboards, SLOs/SLIs/alerts¶

Objectives¶

Deliver an observability-first design that provides end-to-end traces, actionable metrics, and structured logs across ECS services (Gateway, YARP, Registry, Policy Engine, Refresh Orchestrator, Adapter Hub, SDKs). Define trace taxonomy, metrics catalogs, dashboards, and SLOs/SLIs/alerts with multi-tenant visibility and low-cardinality guardrails.

Architecture & Dataflow¶

flowchart LR
  subgraph Client
    SDKs[SDKs / Studio]
  end

  WAF[WAF/DoS]
  Envoy[Envoy (Edge)]
  YARP[YARP (BFF)]
  REG[Config Registry]
  PDP[Policy Engine (PDP)]
  ORC[Refresh Orchestrator]
  HUB[Provider Adapter Hub]
  CRDB[(CockroachDB)]
  REDIS[(Redis)]
  BUS[(Event Bus)]

  COL[OTEL Collector]
  TS[(Time-series DB)]
  LS[(Log Store/Index)]
  TMS[(Tracing DB)]
  APM[Dashboards/Alerting]

  SDKs-->WAF-->Envoy-->YARP-->REG
  REG-->PDP
  REG-->REDIS
  REG-->CRDB
  REG-->BUS
  ORC-->BUS
  HUB-->Providers[(External Providers)]
  HUB-->BUS

  Envoy--OTLP-->COL
  YARP--OTLP-->COL
  REG--OTLP-->COL
  PDP--OTLP-->COL
  ORC--OTLP-->COL
  HUB--OTLP-->COL
  COL-->TS
  COL-->LS
  COL-->TMS
  APM---TS
  APM---TMS
  APM---LS

Hold "Alt" / "Option" to enable pan & zoom

Collector topology

DaemonSet + Gateway collectors; tail-based sampling at gateway.
Pipelines: traces (OTLP → sampler → TMS), metrics (OTLP → TS), logs (OTLP/OTLP-logs → LS).
Exemplars: latency histograms carry trace IDs to deep-link from metrics to traces.

Trace Taxonomy & Propagation¶

Naming (verb-noun, dot-scoped)¶

Layer	Span name	Kind
Edge	`ecs.http.server` (Envoy)	SERVER
BFF	`ecs.http.proxy` (YARP routeId)	SERVER/CLIENT
Registry	`ecs.config.resolve`, `ecs.config.publish`, `ecs.config.diff`	SERVER
PDP	`ecs.policy.decide`, `ecs.policy.validate`	SERVER
Orchestrator	`ecs.refresh.fanout`, `ecs.refresh.invalidate`	INTERNAL
Adapter Hub	`ecs.adapter.put`, `ecs.adapter.watch`	CLIENT/SERVER
DB	`db.sql.query` (CRDB)	CLIENT
Redis	`cache.get`, `cache.set`, `cache.lock`	CLIENT
Bus	`messaging.publish`, `messaging.process`	PRODUCER/CONSUMER
SDK	`ecs.sdk.resolve`, `ecs.sdk.subscribe`	CLIENT

Required attributes (consistent keys)¶

tenant.id, env.name, app.id, edition.id, plan.tier
route, http.method, http.status_code, grpc.code
user.sub (hashed), actor.type (user|machine)
config.path or config.set_id, etag, version
policy.effect (allow|deny|obligate), policy.rules_matched
cache.hit (true|false|swr), cache.layer (memory|redis)
db.statement_sanitized (true)
rls.result (ok|throttled)
error.type, error.code

Context propagation¶

W3C Trace Context: traceparent + tracestate
Baggage (limited): tenant.id, env.name, app.id (≤3 keys)
CloudEvents: include traceparent extension; correlationid mirrors trace ID for non-OTLP consumers.
Headers surfaced to services: x-correlation-id (stable), traceparent (for logs).

Sampling¶

Head sampling default 10% for healthy traffic (dynamic).
Tail-based rules (max 100%):
- http.status >= 500 or grpc.code != OK
- Latency above p95 per route
- policy.effect != allow
- route in [/deployments, /snapshots, /refresh]
- Tenant allowlist (troubleshooting sessions)

Metrics Catalog (SLI-ready, low-cardinality)¶

Dimensions used widely: tenant.id (hashed), env.name, region, route|rpc, result, avoid key/path labels.

Edge & Gateway¶

http_requests_total{route,code,tenant} (counter)
http_request_duration_ms{route} (histogram with exemplars)
ratelimit_decisions_total{tenant,result} (counter)

Registry / Resolve path¶

resolve_requests_total{result} (counter: hit|miss|304|200)
resolve_duration_ms{result} (histogram)
cache_hits_total{layer} (counter)
waiter_pool_size (gauge), coalesce_ratio (gauge)
etag_mismatch_total (counter)

Publish / Versioning¶

publish_total{result} (counter)
publish_duration_ms (histogram)
diff_duration_ms (histogram), snapshot_size_bytes (histogram)
policy_block_total{reason} (counter)

PDP¶

pdp_decisions_total{effect} (counter)
pdp_duration_ms (histogram)
pdp_cache_hit_ratio (gauge)

Refresh Orchestrator & Eventing¶

refresh_events_total{type} (counter)
propagation_lag_ms (histogram: publish→first 200 resolve)
ws_connections{tenant} (gauge), ws_reconnects_total (counter)
dlq_messages{queue} (gauge), replay_total{result} (counter)

Adapters¶

adapter_ops_total{provider,op,result} (counter)
adapter_op_duration_ms{provider,op} (histogram)
adapter_throttle_total{provider} (counter)
adapter_watch_gaps_total{provider} (counter)

Data stores¶

CRDB: sql_qps, txn_restarts_total, replica_leaseholders_ratio, kv_raft_commit_latency_ms
Redis: hits, misses, latency_ms, evictions, blocked_clients

Logs (structured, privacy-safe)¶

Envelope (JSON)¶

{
  "ts":"2025-08-25T10:21:33.410Z",
  "sev":"INFO",
  "svc":"registry",
  "msg":"Resolve completed",
  "trace_id":"d1f0...",
  "span_id":"a21e...",
  "corr_id":"c-77f2...",
  "tenant.id":"t-***5e",
  "env.name":"prod",
  "route":"/api/v1/resolve",
  "duration_ms":42,
  "cache.hit":true,
  "etag":"9oM1hQ...",
  "user.sub_hash":"u-***9a",
  "code":"OK",
  "extra":{"coalesced":3}
}

Redaction & hygiene

Never log config values or secrets; keys hashed when necessary.
PII hashed/salted; toggle dev sampling only in non-prod.
Enforce log schema via collector processors; drop fields not in schema.

Dashboards (role-focused)¶

1) SRE — “ECS Golden Signals”¶

Edge: request rate, error %, p95 latency by route, 429 rate.
Resolve: hit ratio (memory/redis), resolve p95, waiter size/coalesce ratio.
Propagation: publish→first resolve lag p95/p99, WS connections, DLQ size.
DB/Cache: CRDB restarts & raft latency, Redis latency & evictions.
Burn-rate tiles for SLOs (see below) with auto-links to exemplar traces.

2) Product/Platform — “Tenant Health”¶

Per-tenant overview: usage vs quota, top endpoints, rate-limit hits, policy denials.
Drill-down: deployment success rate, change lead time (draft→publish).

3) Security — “Policy & Access”¶

PDP decision mix, denied reasons, break-glass actions, secret-ref validation failures.
Geo & residency conformance (data access by region).

4) Adapters & Eventing¶

Provider op success/latency by adapter, throttle events, watch gaps.
Event bus throughput, consumer lag, DLQ trends, replay outcomes.

SLOs, SLIs & Alerts¶

Service SLOs (initial targets)¶

Service/Path	SLI	SLO (rolling 30d)	Error budget
Resolve (SDK)	Availability = 1−(5xx+network errors)/all	99.95%	21m/mo
	Latency p95 (in-region)	≤ 250 ms	N/A
Publish (Studio/API)	Success rate for publish operations	99.5%	3.6h/mo
	Publish→first client refresh p95	≤ 5 s (Pro/Ent), ≤ 15 s (Starter)	N/A
PDP	Decision p95	≤ 5 ms	N/A
WS Bridge	Connection stability (drop rate per hour)	≤ 1%	N/A

Definitions

Propagation lag = time from ConfigPublished accepted to first successful client Resolve(200/304) with new ETag per tenant/env.

Multi-window burn-rate alerts (error-budget based)¶

Page: BR ≥ 14 over 5m and ≥ 7 over 1h
Warn: BR ≥ 2 over 6h and ≥ 1 over 24h

Symptom alerts (key signals)

Resolve p95 > 300 ms (10m)
Rate-limit 429% > 2% (5m) for any tenant
Propagation lag p95 > 8 s (15m) or DLQ size > 100 for 10m
PDP decision p99 > 20 ms (10m)
Redis evictions > 0 (5m) sustained

All alerts annotate current incidents with links to exemplar traces and last deployments.

Collector & Policy (reference snippets)¶

Tail-based sampling (OTEL Collector)¶

receivers:
  otlp: { protocols: { http: {}, grpc: {} } }

processors:
  attributes:
    actions:
      - key: tenant.id
        action: hash
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: errors
        type: status_code
        status_codes: {status_codes: [ERROR]}
      - name: long-latency
        type: latency
        latency: {threshold_ms: 250}
      - name: important-routes
        type: string_attribute
        string_attribute: {key: http.target, values: ["/deployments", "/snapshots", "/refresh"]}
exporters:
  otlphttp/traces: { endpoint: "https://trace.example/otlp" }
  prometheus: { endpoint: "0.0.0.0:9464" }
  loki: { endpoint: "https://logs.example/loki/api/v1/push" }

service:
  pipelines:
    traces: { receivers: [otlp], processors: [attributes, tail_sampling], exporters: [otlphttp/traces] }
    metrics: { receivers: [otlp], processors: [], exporters: [prometheus] }
    logs:    { receivers: [otlp], processors: [attributes], exporters: [loki] }

Correlation & Troubleshooting Playbooks¶

Start from SLO tiles → click exemplar to open failing trace.
From trace: check policy.decide span (effect & rules), cache layer tags, db/sql timings.
If 429s spike: inspect ratelimit_decisions_total by tenant; apply temporary override or identify hot route.
Propagation issues: check DLQ and refresh.fanout spans; verify Redis latency; perform targeted replay.
Regional anomalies: compare leaseholder distribution and raft commit latency.

Cardinality & Cost Guardrails¶

Disallow high-cardinality labels: raw key/path, user IDs, stack traces in metrics.
Hash tenant IDs and user subjects; limit unique operation names.
Log retention: 14d (non-audit) default; 30–365d for audit in dedicated store.
Metrics retention: 13 months downsampled; traces 7–14d with error traces kept longer.

Acceptance Criteria (engineering hand-off)¶

OTEL SDKs enabled in all services and client SDKs (.NET/JS/Mobile) with required attributes.
Collector pipelines deployed (daemonset + gateway) with tail-based sampling and attribute hashing.
Metrics emitted per catalog; histograms with exemplars enabled.
Structured JSON log schema enforced; PII/secret redaction verified.
Dashboards delivered for SRE, Tenant Health, Security, Adapters/Eventing.
SLOs encoded in monitoring system with burn-rate alert policies and on-call rotations.
Runbooks published for 429 spikes, propagation lag, DLQ growth, PDP degradation.

Solution Architect Notes¶

Consider exemplars+RUM in Studio to tie user actions to backend traces.
Add feature-flagged debug sampling per tenant (temporary, auto-expires) to diagnose issues without global sampling changes.
Revisit metrics cardinality quarterly; adopt RED/USE dashboards for each microservice as standard templates.

Performance & Capacity — read-heavy scaling plan, Redis sizing, p99 targets, load profiles, KEDA/HPA¶

Objectives¶

Design a read-heavy, latency-bounded capacity plan that meets ECS SLOs under bursty, multi-tenant loads. This section specifies p99 targets, traffic models, Redis sizing, and autoscaling policies (KEDA/HPA) for Registry/Resolve APIs, Refresh Orchestrator, WS/Long-Poll bridges, and Adapter Hub.

Targets & Guardrails¶

Path	Region	p95	p99	Notes
Resolve (cache hit)	in-region	≤ 50 ms	≤ 150 ms	CPU-bound; Redis round-trip avoided
Resolve (cache miss)	in-region	≤ 200 ms	≤ 400 ms	Redis hit + CRDB read-through
Long-Poll wake→200/304	in-region	≤ 1.5 s	≤ 3 s	From publish to client response
WS push→client revalidate	in-region	≤ 2.5 s	≤ 5 s	Aligns with propagation SLOs
Publish (snapshot→accepted)	in-region	≤ 1.0 s	≤ 2.0 s	Excludes canaries/approvals

Error budgets inherit from Observability cycle: Resolve availability ≥ 99.95%.

Traffic Modeling (read-heavy)¶

Let:

N_t = tenants, A = apps/tenant, E = envs/app, S = sets/env, K = keys/set
H = cache hit ratio at SDK (L1) → target ≥ 0.85
R_sdk = average client read rate per instance (req/s) when polling/long-poll refreshes
C = active client instances

Resolve QPS to Gateway/Registry¶

Periodic pull: QPS = C × R_sdk × (1 − H)
Long-poll (recommended): steady-state QPS ≈ C × (timeout⁻¹) × (1 − H), with held connections ≈ C. Example: C=20k, timeout=30 s, H=0.9 → QPS ≈ 20k × (1/30) × 0.1 ≈ 66.7 RPS.
WS push + conditional fetch: QPS ≈ events × fanout_selectivity × (1 − H_etag); typical << long-poll.

Memory for Long-Poll Waiters¶

Per waiter ~ 1–3 KB (request ctx + selector). With C=20k waiters → 20–60 MB of heap headroom per region (well within a small pool of pods).

Redis Sizing & Topology¶

Key formulas¶

Hot value size B (compressed on wire): target 8–32 KB (avg take 16 KB).
Cache entry ≈ B × overhead_factor where overhead (key + object + allocator) ≈ 1.3–1.5. Use 1.4.
Working set per region WS ≈ active_keys × B × 1.4.

Shard plan¶

Aim per shard ≤ 25 GB resident (for 64 GB nodes → memory headroom & AOF/RDB).
Shards per region = ceil(WS / 25 GB), then ×2 for replicas.

Example (Pro/Ent region)¶

Tenants actively reading: 1,000
Active keys/tenant hot: 800
active_keys = 1,000 × 800 = 800,000
WS ≈ 800,000 × 16 KB × 1.4 ≈ 17.9 GB
Shards needed: 1 (primary) + 1 (replica) → deploy 3 shards to allow headroom and growth, then rebalance.

Throughput

Single shard (modern VM, network-optimized) sustains 80–120 k ops/s p99 < 2 ms for GET/SET under pipeline.
With hit ratio ≥ 0.9, Registry avoids CRDB on ≥ 90% of reads.

Policies

Eviction: volatile-lru (Enterprise) / allkeys-lru (Starter/Pro).
SWR ≤ 2 s to bound staleness and prevent stampedes.
Hash-tagging keys by {tenant} to isolate hotspots and ease resharding.

Pod Right-Sizing & Concurrency¶

Component	CPU req/lim	Mem req/lim	Concurrency (target)	Notes
Registry/Resolve (.NET)	250m / 1.5c	512 Mi / 1.5 Gi	400–600 in-flight	Kestrel + async IO, thread-pool min tuned
WS Bridge	200m / 1c	256 Mi / 1 Gi	5k conns/pod	uWSGI/Kestrel; heartbeat @ 30 s
Refresh Orchestrator	200m / 1c	256 Mi / 1 Gi	200 msgs/s	CPU-light, IO-heavy
Adapter Hub	300m / 1.5c	512 Mi / 1.5 Gi	1k ops/s	Batch writes; backpressure aware

Autoscaling (HPA v2 + KEDA)¶

Registry/Resolve — HPA with custom metrics¶

Primary: in-flight request gauge http_server_active_requests (target ≤ 400/pod)
Secondary: CPU target 60%
Tertiary: p95 latency guardrail (scale-out if > 200 ms for 5 m)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: ecs-registry }
spec:
  minReplicas: 4
  maxReplicas: 48
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_server_active_requests
        target:
          type: AverageValue
          averageValue: "400"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent; value: 100; periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent; value: 25; periodSeconds: 60

WS Bridge — HPA on connection count¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: ecs-ws-bridge }
spec:
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric: { name: ws_active_connections }
        target: { type: AverageValue, averageValue: "4000" }  # keep <4k/pod

Refresh Orchestrator — KEDA on bus depth/throttle¶

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: ecs-refresh-orchestrator }
spec:
  scaleTargetRef: { name: ecs-refresh-orchestrator }
  pollingInterval: 5
  cooldownPeriod: 60
  minReplicaCount: 2
  maxReplicaCount: 40
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: ecs.refresh.events
        namespace: sb-ecs
        messageCount: "500"          # 1 replica per 500 pending
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: propagation_lag_ms
        threshold: "3000"            # scale if lag > 3s

Adapter Hub — KEDA on ASB/Rabbit and provider throttles¶

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: ecs-adapter-hub }
spec:
  scaleTargetRef: { name: ecs-adapter-hub }
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: adapter_throttle_total
        threshold: "0"               # scale up if throttling detected
    - type: prometheus
      metadata:
        metricName: adapter_ops_backlog
        query: sum(rate(adapter_ops_queued_total[1m]))
        threshold: "200"

Load Profiles & Playbooks¶

LP-1 Baseline Steady State¶

Long-poll connections equal to active clients; low QPS due to conditional 304s.
SLO focus: latency p99 and WS stability.
Action: monitor waiter pool size, Redis hit ratio ≥ 0.9.

LP-2 Publish Wave (canary→regional)¶

Burst of RefreshEvents, clients revalidate.
Action: Orchestrator coalesces events; Registry scales on in-flight requests; Redis lock singleflight enabled.

LP-3 Tenant Hotspot¶

One tenant updates many keys; selectivity high.
Action: Per-tenant rate limit clamps; Redis hash-tag isolates shard; temporary override if approved.

LP-4 Cold Start / Region Failover¶

Rapid connection re-establish; cache cold.
Action: Pre-warm Redis with last heads; disable aggressive scale-down; surge HPA caps increased for 30 min.

LP-5 Adapter Backpressure¶

Provider throttles (Azure/AWS).
Action: KEDA scales Hub, adjusts batch sizes; DLQ monitored; replay after throttle clears.

Capacity Examples (per region)¶

Scenario A — 20k clients, long-poll 30 s, H=0.9¶

Waiters: 20k; Gateway pods with 4k conns/pod → 5 pods minimum.
Resolve QPS: ~67 RPS; 4 Registry pods (400 in-flight each) → ample headroom.
Redis: WS ≈ 10–20 GB → 3-shard cluster (1 primary, 1 replica each) for headroom.

Scenario B — 100k clients, mixed WS/poll¶

70% WS, 30% long-poll; H=0.85.
Long-poll QPS ≈ 30k × (1/30) × 0.15 ≈ 150 RPS.
WS Bridge: 100k conns total → 25 pods (4k/pod).
Registry: 8–12 pods depending on p95.
Redis WS ~ 40–60 GB → 3–4 primaries + replicas.

Scale numbers must be validated in perf tests; start with 2× headroom over 30-day peak.

CRDB Considerations (read-through misses)¶

Keep leaseholders local to region (REGIONAL BY ROW).
Target miss rate ≤ 10%; CRDB nodes sized for 2–5k QPS aggregated reads with p99 < 20 ms KV.
Monitor txn_restarts_total; if > 2% elevate LIMITS and slow publishers.

Performance Testing Plan¶

Tool	Purpose	Profile
k6 / Locust	Resolve & long-poll	VUs ramp to 100k conns; p95/p99 latency
Vegeta	Publish bursts	10–50 RPS publish; measure propagation lag
Custom WS harness	Push & reconnect storms	100k connections, 1% churn/min
Redis benchmark	Cache ops	pipelined GET/SET @ 128B–64KB values

Exit criteria

Sustained p99 Resolve within targets at 1.5× projected peak.
No DLQ growth during publish wave; propagation p95 within ≤ 5 s.
HPA/KEDA converges within < 90 s of surge onset.

Cost & Efficiency Levers¶

Prefer long-poll default; enable WS for high-value tenants only.
SWR and TTL jitter reduce cache churn by 20–40% under waves.
Coalescing ratio target ≥ 3× (one backend fetch serves ≥ 3 waiters).
Scale-to-zero for low-traffic Adapters via KEDA cooldown.

Runbooks (snippets)¶

Latency p99 regression

Check coalesce_ratio and waiter_pool_size; if low, enable singleflight tuning.
Inspect Redis latency_ms and evictions; add shard if > 3 ms p95 or any evictions.
Verify HPA not pinned; temporarily raise maxReplicas.

Propagation lag > 8 s

Validate bus consumer lag; KEDA scale orchestrator.
Check WS connections; if drops > 1%/h, investigate bridge GC/ephemeral port exhaustion.
Redis lock contention > 10 ms → increase lock TTL and reduce batch size.

Acceptance Criteria (engineering hand-off)¶

HPA manifests for Registry and WS Bridge; KEDA ScaledObjects for Orchestrator and Adapter Hub committed.
Redis cluster charts with shard math and hash-tag strategy documented; alerts for evictions>0.
Perf harnesses & CI jobs executing weekly; publish p95/p99 & coalescing ratio to dashboards.
Capacity workbook with what-if calculators (tenants, clients, TTL, H) shared with SRE.
Runbooks for failover, surge, hot tenant, adapter throttle scenarios.

Solution Architect Notes¶

Maintain two autoscaling lanes: fast (in-flight, latency) and slow (CPU). Tune stabilization to avoid oscillation.
Revisit connection per pod limits after GC tuning; consider SO_REUSEPORT listeners for WS scale.
For very large tenants, consider enterprise per-tenant Redis or keyspace quotas to cap blast radius.
Evaluate async Resolve (server push of materialized blobs to edge caches) if hit ratio < 0.8 at scale.

Resiliency & Chaos — timeouts/retries/backoff, bulkheads, circuit breakers, chaos experiments & runbooks¶

Objectives¶

Engineer ECS to degrade gracefully under dependency faults, traffic spikes, and regional impairments. This section codifies timeouts, retries, and backoff, bulkheads, circuit breakers, and a chaos program with experiments and runbooks. Defaults are opinionated yet overridable via per-service policy.

Resiliency Profiles (standardized)¶

# Resiliency profiles are shipped as config + code defaults; services load on boot.
profiles:
  standard:
    timeouts:
      http_connect_ms: 2000
      http_overall_ms: 4000
      grpc_unary_deadline_ms: 250          # Resolve path (in-region)
      grpc_batch_deadline_ms: 2000
      grpc_stream_idle_min: 60
      redis_ms: 200
      sql_read_ms: 500
      sql_write_ms: 1500
      bus_send_ms: 2000
    retries:
      strategy: "decorrelated_jitter"      # FullJitter/EqualJitter allowed
      max_attempts: 3
      base_delay_ms: 100
      max_delay_ms: 800
      retry_on: ["UNAVAILABLE","DEADLINE_EXCEEDED","RESOURCE_EXHAUSTED","5xx"]
      idempotency_required: true           # enforced by client for POST/PUT/DELETE
    circuit_breaker:
      sliding_window: "rolling"
      window_size_sec: 30
      failure_rate_threshold_pct: 50
      min_throughput: 20
      consecutive_failures_to_open: 5
      open_state_duration_sec: 30
      half_open_max_concurrent: 5
    bulkheads:
      max_inflight_per_pod: 600            # Registry
      per_tenant_max_inflight: 80
      redis_pool_size: 256
      sql_pool_size: 64
      http_pool_per_host: 128
    fallback:
      serve_stale_while_revalidate_ms: 2000
      long_poll_timeout_s: 30
      downgrade_to_poll_on_ws_fail: true
  conservative:                             # for cross-region or under incident
    timeouts:
      grpc_unary_deadline_ms: 400
      redis_ms: 300
    retries:
      max_attempts: 2
      base_delay_ms: 200
      max_delay_ms: 1200
    circuit_breaker:
      open_state_duration_sec: 60

Decorrelated jitter backoff (pseudo):

sleep = min(max_delay, random(b, sleep * 3))   # b = base_delay; start with sleep=b

Timeouts, Retries & Backoff — per dependency¶

Caller → Callee	Timeout	Retries (strategy)	Preconditions
SDK → Registry (Resolve unary)	250–400 ms deadline	2–3 (jitter)	Always send `If-None-Match` ETag; idempotent
SDK ↔ WS Bridge (push)	Heartbeat 15–30 s, close after 3 misses	Reconnect with backoff 0.5–5 s	Resume token supported
Registry → Redis	200 ms	2 (jitter 50–250 ms)	Use singleflight lock on fill
Registry → CRDB (read)	500 ms	1 (if `RETRY_SERIALIZABLE`)	Read-only; bounded scans
Registry → CRDB (write)	1.5 s	2 (txn restart only)	Idempotency key on publish
Orchestrator → Bus	2 s	5 (exp jitter up to 60 s)	Outbox pattern ensures atomicity
Hub → Adapters	1–3 s per op	3 (provider-aware backoff)	Idempotency key mandatory
Adapter → Provider (watch)	Stream idle 60 min	Auto-resume with bookmark	Backpressure aware
Gateway → PDP	5–20 ms	0 (timeout only)	Fallback deny on timeout (safe default)

Rules

Retries only on transient errors. Never retry on 4xx (except 429 with Retry-After).
Mutations require Idempotency-Key; otherwise retries are disabled.
Budget: total attempt duration ≤ caller timeout; avoid retry storms.

Bulkheads & Isolation¶

Concurrency bulkheads

Per-pod: cap in-flight requests (Registry ≤ 600), queue excess briefly (≤ 50 ms) then 429.
Per-tenant: token bucket at Gateway; defaults by edition (Starter 50 RPS, Pro 200, Ent 1000; burst ×2).
Per-dependency pools: separated HttpClient pools per host; Redis/SQL dedicated pools with backpressure.

Resource bulkheads

Thread pools: pre-warm worker count; limit sync over async.
Queue isolation: separate DLQ/parking lots per consumer to avoid global blockage.
Waiter map (long-poll): bounded dictionary; spill to 304 on exhaustion.

Blast-radius controls

Hash-tag keys {tenant} in Redis; circuit break per tenant first, then globally only if necessary.
Rate-limit publish waves (canary → region) via Orchestrator.

Circuit Breakers¶

State machine (per dependency, per tenant when applicable)

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: failure rate > threshold OR 5 consecutive failures
  Open --> HalfOpen: after open_duration
  HalfOpen --> Closed: probe successes (>=3/5)
  HalfOpen --> Open: any failure

Hold "Alt" / "Option" to enable pan & zoom

Telemetry & policy

Emit ecs.circuit.state gauge {service, dependency, tenant?, state}.
Closed: normal.
Open: SDKs and services fail fast with fallback (serve stale, downgrade to poll, deny writes).
Half-open: limit concurrent probes to ≤5.

Degrade Modes & Fallbacks¶

Fault	Degrade Mode	Fallback
WS Bridge down	Switch clients to long-poll; disable optional features (live diff)	Increase poll timeout to 45 s, jitter
Redis unavailable / high latency	Bypass Redis; SWR serve stale ≤2 s; coalesce requests	Tighten per-tenant concurrency; warm cache after recovery
CRDB read slowness	Raise Resolve deadlines to 400 ms; prefer cache; shed low-priority traffic	429 on hot tenants; backpressure publishes
Bus throttling	Slow fan-out, batch invalidations	Replay after throttle; keep sdk long-poll working
Adapter throttling	Reduce batch size; backoff; serialize writes	Queue to outbox; DRY-RUN validation only
PDP degraded	Gateway ext_authz TTL uses last good (≤ 60 s)	Fail-closed on admin ops; show banner in Studio

Chaos Engineering Program¶

Steady-state hypothesis: Under defined faults, SLOs hold, or the system degrades predictably (bounded latency, clear errors), and recovers automatically without manual intervention.

Experiments Matrix¶

ID	Fault Injection	Scope	Hypothesis	Success Criteria
C-1	Add 200 ms latency to Redis	Single region, 15 min	p99 Resolve ≤ 400 ms; cache hit ratio dips < 10%	No SLO breach; circuits stay closed; SWR < 2 s
C-2	Redis node killover	1 shard primary	Operations continue; brief p99 blip	No data loss; recovery < 30 s; 0 evictions
C-3	CRDB leaseholder move	Hot range	Miss path p99 ≤ 500 ms	Txn restarts < 3%; no timeouts
C-4	Bus 429/ServerBusy	Orchestrator consumers	Propagation p95 ≤ 8 s	DLQ stable; replay after clear
C-5	WS Bridge crash loop	Region	Clients auto-downgrade to long-poll	Connections restore; no error budget burn
C-6	DNS blackhole to Adapter provider	Single binding	Hub circuits open per binding only	Other tenants unaffected; queued ops replay
C-7	Partial partition (drop 10% packets)	Gateway↔Registry	Retries/backoff prevent storms	No cascading failure; HPA scales sanely
C-8	Token/JWKS rollover mid-traffic	Edge	Zero auth errors beyond overlap	Authn failure rate < 0.1% spike
C-9	Time skew 2 min on 10% pods	Mixed services	Token validation & TTL tolerate	No systemic 401/304 anomalies
C-10	Region failover drill	Full region	RTO within playbook; SLOs in surviving regions	No cross-tenant leakage; clear comms

Tooling

Layer 4/7 faults: Envoy fault filters, ToxiProxy.
Platform: chaos mesh or k6 chaos; K8s PDBs validated.
Data: CRDB workload generator, leaseholder move via ALTER TABLE ... EXPERIMENTAL_RELOCATE.

Schedule

GameDays quarterly, rotating owners; pre-approved windows.
Results recorded with hypotheses, evidence, fixes.

Runbooks (actionable)¶

RB-1 Resolve p99 > target (sustained 10 min)¶

Dashboards: check cache_hits_total, redis.latency_ms, coalesce_ratio, waiter_pool_size.
If Redis latency > 3 ms p95 → add shard, enable client-side SWR (2 s), increase Redis pool.
If coalesce_ratio < 2 → raise waiter debounce to 300–500 ms.
HPA: verify Registry replicas not capped; raise maxReplicas temporarily.

RB-2 429 spikes for a tenant¶

Confirm token bucket at Gateway; inspect tenant QPS.
If legitimate spike, increase burst temporarily; otherwise enable cooldown and inform tenant.
Enable feature flag to reduce poll frequency for that tenant.

RB-3 DLQ growth in refresh pipeline¶

Inspect latest DLQ messages (reason); if transient → bulk replay.
If schema/contract mismatch → parking lot and open incident; restrict publish to canary.
Scale Orchestrator via KEDA; check bus quotas.

RB-4 Adapter throttling¶

Reduce batch size 50%, increase backoff to max 60 s.
Mark binding degraded; notify tenant (Studio banner).
After clear, replay pending ops; compare provider vs desired.

RB-5 WS instability (drop rate > 1%/h)¶

Check ephemeral ports, GC pauses on pods; rotate nodes if needed.
Switch affected tenants to long-poll via feature flag.
Audit heartbeats & missed count; tune keep-alive timeouts.

RB-6 CRDB restart spikes / txn restarts > 3%¶

Inspect hot ranges; consider index tweak or split.
Increase SQL pool temporarily; ensure queries are parameterized and short.
Scale Registry reads; verify leaseholder locality.

Implementation Guidance (services & SDKs)¶

.NET services

Use Polly (or built-in resilience pipeline in .NET 8) for retry/circuit/bulkhead.
One HttpClient per dependency/host; enable HTTP/2 for gRPC.
Redis: use pipelining and cancellation tokens; set SyncTimeout=redis_ms.

JS/TS SDK

Backoff via fetch wrapper; AbortController for deadlines.
WS reconnect with decorrelated jitter; resumeAfter cursor.
Persistent cache guarded by ETag; SWR toggled via server hint.

Mobile

Network reachability gating; background fetch limit; throttle under battery saver.

Config knobs (server)

All limits under resiliency.* with tenant/edition overrides; hot-reload on config change.
Emit current effective policy into metrics (resiliency_profile{name="standard"} gauge = 1).

Testing & Verification¶

Unit: retry/circuit behavior with virtual time; idempotency invariants.
Contract: simulate 429/503/UNAVAILABLE/DEADLINE_EXCEEDED across clients.
Load: soak tests 24h with chaos toggles (C-1..C-7).
Failover drills: at least semi-annual region evacuation exercises.

Acceptance Criteria (engineering hand-off)¶

Resiliency profile library packaged; services load standard by default with overrides.
Polly/Resilience pipelines configured for Registry, Orchestrator, Hub, WS Bridge with metrics and logs.
Circuit breaker telemetry and alerts active (ecs.circuit.state, failure_rate).
Chaos runners and manifests for experiments C-1..C-7 in staging; results documented.
Runbooks RB-1..RB-6 linked in on-call docs; PagerDuty alerts mapped to owners.

Solution Architect Notes¶

Keep retry budgets tight; retries amplify load—prefer fail fast + fallback.
Favor per-tenant breakers to preserve global health during hot-tenant incidents.
Extend chaos to client side (SDKs) in sandbox apps to validate downgrade paths (WS→poll, stale serves).
Consider adaptive concurrency (AIMD) at Gateway if 429s recur across many tenants.

Migration & Import — bootstrap pathways, bulk import, diff reconcile, blue/green config cutover¶

Objectives¶

Provide safe, repeatable pathways to bring existing configurations into ECS, reconcile differences, and execute a zero-downtime cutover to ECS-managed configuration. This section defines bootstrap options, bulk import contracts, three-way diff & reconcile, and blue/green cutover patterns with runbooks and acceptance criteria.

Migration Personas & Starting Points¶

Persona	Starting System	Typical Shape	Primary Path
Platform Admin	Fresh tenant	No prior config	Bootstrap Empty (starter templates)
SRE/DevOps	Azure AppConfig / AWS AppConfig	Hierarchical keys + labels	Provider-Sourced Import via Adapter Hub
Backend Dev	Files in Git (JSON/YAML)	Namespaced files per env	File-Based Import (CLI/Studio)
Operations	Consul/Redis/SQL	Flat/prefix keys	Provider-Sourced Import + Mapping Rules

Bootstrap Pathways¶

1) Bootstrap Empty (Templates)¶

Create tenant, apps, envs, namespaces.
Seed starter Config Sets and policy packs (edition overlays).
Protect with schema validation out of the gate.

sequenceDiagram
  participant Admin
  participant Studio
  participant Registry
  Admin->>Studio: Create Tenant/Apps/Envs (wizard)
  Studio->>Registry: POST /tenants/... (idempotent)
  Registry-->>Studio: Seed templates + policies

Hold "Alt" / "Option" to enable pan & zoom

2) Provider-Sourced Import (Adapters)¶

Use Adapter Hub to read from source (Azure/AWS/Consul/Redis/SQL).
Produce intermediate snapshot in ECS format.
Run validate + diff against ECS baseline; reconcile, then publish.

3) File-Based Import (CLI/Studio)¶

Upload ZIP containing manifest.yaml + configs/*.json|yaml.
CLI validates schema locally, calculates hash, and performs idempotent batch import.

Import Data Model & Contracts¶

Canonical Import Manifest (YAML)¶

apiVersion: ecs.migration/v1
kind: ImportBundle
metadata:
  tenantId: t-123
  source: azure-appconfig://appcfg-prod?label=prod
  changeId: mig-2025-08-25T10:00Z  # idempotency key
spec:
  defaultEnvironment: prod
  mappings:
    - sourcePrefix: "apps/billing"
      targetNamespace: "billing"
      environment: "prod"
      keyTransform: "stripPrefix('apps/billing/')"  # helpers: stripPrefix, toKebab, replace
      contentTypeRules:
        - match: "**/*.json"
          contentType: "application/json"
  items:
    - key: "apps/billing/db/connectionString"
      valueRef: "kvref://vault/billing/db-conn#v3"   # secrets as refs
      meta: { labels: ["prod","blue"] }
    - key: "apps/billing/featureToggles/enableFoo"
      value: true
      meta: { contentType: "application/json" }

Rules

Secrets: only valueRef (kvref://…) allowed for sensitive fields; plaintext rejected by policy.
Idempotency: metadata.changeId required; server dedupes full bundle and each batch chunk.
Size limits: default 5k items/bundle (configurable); chunks of ≤500 items.

REST & CLI¶

POST /v1/imports (multipart/zip or application/json)
Headers:
  Idempotency-Key: mig-2025-08-25T10:00Z
Response: { importId, statusUrl }

GET  /v1/imports/{importId}/status

CLI

ecsctl import apply ./bundle.zip --tenant t-123 --dry-run
ecsctl import plan ./bundle.zip   # prints diff summary
ecsctl import approve <importId>  # kicks reconcile+publish with policy gates

Three-Way Diff & Reconcile¶

States

Desired: Import bundle (or provider snapshot post-mapping)
Current: ECS head (latest published Snapshot/Version)
Last-Applied: Previous import’s applied hash (if any) to avoid flip-flop

flowchart LR
  D[Desired] --- R{Reconcile Engine}
  C[Current] --- R
  L[Last-Applied] --- R
  R --> Patch[JsonPatch + Semantic diff]
  Patch --> Plan[Change Plan: upserts/deletes/moves]

Hold "Alt" / "Option" to enable pan & zoom

Algorithm

Normalize keys & canonicalize JSON.
Compute structural diff (RFC 6902) and semantic annotations (breaking/additive).
Build Change Plan:
- Group by namespace/env.
- Respect ignore rules (e.g., provider metadata keys).
- For conflicts (key renamed vs deleted), prefer rename if mapping rule indicates.
Validate with JSON Schema + Policy PDP → may yield obligations (approvals, canary).

Outcomes

Dry-run report: counts (add/remove/replace), risk score, policy obligations.
Apply mode: create Draft, then Snapshot on success (idempotent by content hash).

Bulk Import from Providers (Adapters)¶

sequenceDiagram
  participant Admin
  participant Hub as Adapter Hub
  participant A as Provider Adapter
  participant Reg as Registry
  Admin->>Hub: StartImport(tenant/env/ns, source)
  Hub->>A: List(prefix, pageSize=500)
  A-->>Hub: Items(page) + etags
  loop until done
    Hub->>Reg: POST /imports:chunk (batch=≤500)
    Reg-->>Hub: ChunkAccepted (hash)
    Hub->>A: nextPage()
  end
  Hub->>Reg: FinalizeImport(changeId)
  Reg-->>Admin: Plan ready (diff + obligations)

Hold "Alt" / "Option" to enable pan & zoom

Resiliency

Each chunk has idempotency key: <changeId>-<pageNo>.
Backpressure: throttle to respect provider quotas; exponential backoff on 429/ServerBusy.
DLQ for malformed items with replay token.

Blue/Green Config Cutover¶

Goal: Switch consumers from Blue (current alias) to Green (new snapshot) without downtime and with instant rollback.

Mechanics¶

Publish new snapshot → tag semver (e.g., v1.9.0).
Pin environment alias: prod-next → v1.9.0 (Green).
Canary rollout (optional): restrict RefreshEvents to a slice of services.
Promote: flip prod-current alias from Blue → Green.
Rollback: re-point prod-current to prior Blue snapshot (previous alias pointer).

sequenceDiagram
  participant Studio
  participant Registry
  participant Orchestrator
  Studio->>Registry: Tag v1.9.0; Alias prod-next -> v1.9.0
  Orchestrator-->>Services: Refresh(selectors=canary)
  Note right of Services: validate metrics & errors
  Studio->>Registry: Alias prod-current -> v1.9.0 (cutover)
  Orchestrator-->>All Services: Refresh(all)
  Studio->>Registry: Rollback (if needed) -> prod-current -> v1.8.3

Hold "Alt" / "Option" to enable pan & zoom

Guarantees

Cutover changes only alias pointers (constant-time).
SDKs always revalidate via ETag; if unchanged, 304 path is cheap.
Rollback emits ConfigPublished for the restored alias, preserving at-least-once semantics.

Workflows¶

A) File-Based Import → Plan → Approve → Cutover¶

Prepare bundle (ecsctl import plan).
Dry-run in Studio: policy results + risk score; attach change window if prod.
Approve (SoD, required approvers).
Apply: create Draft → Snapshot; tag vX.Y.Z.
Canary (optional): prod-next → vX.Y.Z; verify metrics.
Cutover: prod-current → vX.Y.Z.
Audit: export plan, approvals, result.

B) Provider-Sourced Live Sync (One-time Migration)¶

Use Adapter to List source keys; apply mappings.
Run Reconcile; fix schema violations (add secret refs).
Freeze writes at source (change window).
Apply plan; cutover aliases.
Lock ECS as source of truth (Optional: disable reverse sync).

Mapping & Transformation Rules¶

Capability	Options
Key transforms	`stripPrefix`, `replace(pattern,repl)`, `toKebab`, `toCamel`, `lowercase`
Label/env mapping	Provider labels → ECS `environment` or `meta.labels[]`
Content types	Infer from extension or explicit `contentTypeRules`
Secret detection	Regex + schema hints → require `kvref://`
Ignore sets	Drop provider control keys (e.g., `_meta`, `__system/*`)

Validation

Mapping rules tested in Preview panel with sampled keys.
Rules stored with ImportBundle to ensure reproducibility.

Safety & Idempotency¶

Idempotency-Key on bundle + chunks; server returns 200/AlreadyApplied if identical.
Snapshot creation idempotent by content hash; repeat imports do not create duplicates.
Write guards: In prod, publish requires approvals & change window policy pass.
Quotas: import throughput rate-limited per tenant; defaults 200 items/s.

Observability¶

Spans: ecs.migration.import.plan, ecs.migration.import.apply, ecs.migration.reconcile, ecs.migration.cutover.
Metrics:
- import_items_total{result=applied|skipped|invalid}
- reconcile_conflicts_total{type=rename|delete|typeMismatch}
- cutover_duration_ms
- rollback_invocations_total
Logs: structured records with changeId, bundleHash, planHash; no values logged.

Failure Modes & Recovery¶

Failure	Symptom	Action
Schema violation	Plan shows blocking errors	Fix mapping or schema; re-plan
Secret plaintext detected	Policy DENY	Convert to `kvref://`; re-plan
Provider throttling	Slow import	Hub backs off; resumes; no data loss
Partial apply due to timeout	Some chunks pending	Re-POST with same `changeId`; idempotent
Bad cutover	Error spikes	Flip alias back to Blue; open incident; analyze diff

Runbooks¶

RB-M1 Plan & Dry-Run¶

ecsctl import plan → review counts & policy summary.
If risk ≥ threshold, escalate to Approver/Tenant Admin.

RB-M2 Approval & Apply¶

Ensure change window active for prod.
Approve in Studio; monitor pdp_decisions_total for obligations.
Apply; verify publish_total{result="success"}.

RB-M3 Canary & Cutover¶

Point prod-next to new version; watch propagation_lag_ms and service error %.
If stable for N minutes, flip prod-current.
If regression, rollback immediately (alias revert).

RB-M4 Provider Freeze & Final Sync¶

Freeze writes on source; take final snapshot via adapter.
Re-plan; apply minimal delta.
Mark ECS authoritative; decommission source path.

Acceptance Criteria (engineering hand-off)¶

Import API + CLI support ZIP & JSON; enforces Idempotency-Key and size limits.
Reconcile engine implements three-way diff with semantic annotations and ignore sets.
Adapter Hub path supports paged list, chunked import, backoff, DLQ & replay.
Studio provides Plan view (diff + policy), Approval wiring, Cutover and Rollback buttons with audit.
Aliasing supports prod-next/prod-current conventions; cutover & rollback are O(1) pointer flips with events emitted.
End-to-end tests: file import, provider import, conflict resolution, blue/green cutover, rollback.

Solution Architect Notes¶

Prefer one-time import then lock ECS as source of truth to avoid dual-write drift; if bi-directional is unavoidable, enforce adapter watch + reconcile with clear owner.
Keep mapping rules versioned with import artifacts; they are part of compliance evidence.
For very large tenants, stage import by namespace and use canary cutovers per namespace to reduce risk.
Consider a read-only preview environment wired to prod-next for smoke testing with synthetic traffic before global cutover.

Compliance & Auditability — audit schema, retention, export APIs, SOC2/ISO hooks, PII posture¶

Objectives¶

Establish a tamper-evident, privacy-aware audit layer with clear retention policies, export/eDiscovery APIs, and baked-in hooks for SOC 2 and ISO 27001 evidence. Guarantee multi-tenant isolation, cryptographic integrity, and least-PII practices across all audit data.

Audit Model & Tamper Evidence¶

Event domains¶

Config Lifecycle: draft edits, validations, snapshots, tags/aliases, deployments, rollbacks.
Policy & Governance: PDP decisions, risk scores, obligations, approval requests/grants/rejects, change windows.
Access & Security: logins (Studio), token failures, role/permission changes, break-glass usage.
Adapters & Refresh: provider sync start/finish, drift detected, cache invalidations, replay actions.
Administrative: retention changes, exports, legal holds, backup/restore actions.

Canonical event (JSON Schema 2020-12 excerpt)¶

{
  "$id": "https://schemas.connectsoft.io/ecs/audit-event.json",
  "type": "object",
  "required": ["eventId","tenantId","time","actor","action","resource","result","prevHash","hash"],
  "properties": {
    "eventId": { "type": "string", "description": "UUIDv7" },
    "tenantId": { "type": "string" },
    "time": { "type": "string", "format": "date-time" },
    "actor": {
      "type": "object",
      "properties": {
        "subHash": { "type": "string" },          // salted hash, not raw subject
        "iss": { "type": "string" },
        "type": { "enum": ["user","service","admin","platform-admin"] },
        "ip": { "type": "string" }                // truncated / anonymized per policy
      }
    },
    "action": { "type": "string", "enum": [
      "Config.DraftEdited","Config.SnapshotCreated","Config.TagUpdated","Config.AliasUpdated",
      "Config.Published","Config.RolledBack","Policy.Decision","Policy.Updated",
      "Approval.Requested","Approval.Granted","Approval.Rejected",
      "Access.RoleChanged","Access.BreakGlass","Adapter.SyncCompleted","Refresh.Invalidate",
      "Export.Started","Export.Completed","Retention.Updated","Backup.Restore"
    ]},
    "resource": {
      "type": "object",
      "properties": {
        "type": { "enum": ["ConfigSet","Snapshot","Policy","Approval","Role","Adapter","Export","Tenant"] },
        "id": { "type": "string" },
        "path": { "type": "string" }             // never secret values; path hashed if sensitive
      }
    },
    "result": { "type": "string", "enum": ["success","denied","error"] },
    "diffSummary": { "type": "object", "properties": { "breaking": {"type":"integer"}, "additive":{"type":"integer"}, "neutral":{"type":"integer"} } },
    "policy": { "type":"object", "properties": { "effect":{"enum":["allow","deny","obligate"]}, "rulesMatched":{"type":"array","items":{"type":"string"}} } },
    "approvals": { "type":"array", "items": { "type":"object", "properties": { "bySubHash":{"type":"string"}, "at":{"type":"string","format":"date-time"}, "result":{"enum":["granted","rejected"]} } } },
    "etag": { "type": "string" },
    "version": { "type": "string" },
    "prevHash": { "type": "string" },
    "hash": { "type": "string" },                // SHA-256(prevHash || canonicalBody)
    "signature": { "type": "string" }            // optional JWS for daily manifests
  }
}

Hash chain & manifests¶

Per-tenant, per-day hash chain: every event stores prevHash and hash=SHA-256(prevHash||canonicalBody).
Daily manifest per tenant: { day, firstEventId, lastEventId, rootHash, count }, signed (JWS) with KMS key (Enterprise).
Verification: ecsctl audit verify --tenant t --from 2025-08-01 --to 2025-08-31 reconstructs chains and validates signatures.

Storage Tiers & Retention¶

Tier	Store	Purpose	Default Retention (Starter / Pro / Enterprise)
Hot	CockroachDB (`event_audit`)	recent queries, Studio timelines	90d / 180d / 365d
Warm	Object store (Parquet, partitioned by `tenantId/day`)	eDiscovery, analytics	— / 12m / 36m
Cold/Archive	Object archive (WORM optional)	long-term compliance	— / — / 7y

Mechanics

Hot tier uses row-level TTL (ttl_expires_at) with hourly jobs.
Nightly compaction/export → Parquet (snappy), plus signed manifest & rootHash.
Legal hold flag on tenant prevents TTL purge and export deletion; records linked ticket id & actor.

Export & eDiscovery APIs¶

Filters & Query DSL¶

Filterable fields: time range, tenantId, environment, action, resource.type/id, actor.type, result, correlationId, policy.effect.
Simple DSL (AND by default, OR with |): time >= 2025-08-01 AND action:Config.Published AND result:success AND env:prod

Endpoints¶

POST /v1/audit/exports
Body:
{
  "tenantId": "t-123",
  "query": "time >= 2025-08-01 AND action:Config.Published",
  "format": "parquet|csv|ndjson",
  "redaction": { "hashSubjects": true, "truncateIp": true, "dropFields": ["resource.path"] },
  "sign": true,                      // Enterprise: attach JWS over manifest
  "encryption": "kms://key-ref",     // optional server-side encryption for export files
  "notify": ["mailto:secops@tenant.com"]
}

GET  /v1/audit/exports/{exportId}/status
GET  /v1/audit/exports/{exportId}/download   // presigned URL, time-limited
DELETE /v1/audit/exports/{exportId}          // marks for deletion if not on legal hold

Streaming/listing

GET /v1/audit/events?from=...&to=...&action=...&pageSize=1000&pageToken=...

Export manifest (signed)

{
  "exportId":"e-7c..",
  "tenantId":"t-123",
  "range":{"from":"2025-08-01T00:00:00Z","to":"2025-08-15T23:59:59Z"},
  "count": 12876,
  "files":[{"path":"s3://.../part-0001.snappy.parquet","sha256":"..."}],
  "rootHash":"...", "signature":"eyJhbGciOiJQUzI..."  // optional
}

Audit Access Model¶

Role	Capabilities
Viewer	Read hot timeline for own tenant; no export
Security Auditor	Create/Download exports with redaction presets; verify chains
Tenant Admin	Manage retention (within policy bounds), legal holds, export keys
Platform Admin	Cross-tenant export (break-glass only, ticket required)

Segregation of Duties: authors of changes cannot approve their own changes and cannot delete/alter audit data. All audit access is itself audited.

SOC 2 / ISO 27001 Hooks¶

Control coverage map (examples)¶

Domain	Control Objective	Evidence Source
Change Management	Approvals required; rollback procedures	Audit events `Approval.*`, `Config.Published`, manifests
Logical Access	Least privilege enforced	Role/permission exports, PDP `Decision` samples
Logging & Monitoring	Audit logging with integrity	Hash chains, signed manifests, collector configs
Encryption	Data at rest/in transit	KMS key inventory, TLS config exports
Backup & Recovery	Regular backups, restores tested	Backup job logs, restore runbooks & attestations
Vendor Management	Adapter access controls	Adapter binding audits, credential rotations

Evidence automation¶

Weekly Evidence Pack job (per tenant, Enterprise): ZIP with
- Signed audit manifest for the week
- Access/role assignment CSV
- Policy bundle digest + PDP cache etag
- SLO dashboards (PDF exports)
- Backup status & last restore drill summary
Webhooks: ecs.compliance.v1.EvidencePackReady for GRC integration.

PII Posture & Privacy¶

Principles¶

Minimize: audit stores metadata, never raw config values or secrets.
Pseudonymize: user identifiers stored as salted hashes (actor.subHash).
Masking: IPs truncated or anonymized per tenant policy; paths containing known PII patterns hashed.
Tagging: audit schema includes dataClass labels; exporters drop high-risk fields by default.

Data subject requests (DSAR)¶

DSAR search runs against audit metadata only; personal data is pseudonymized—responses include existence proofs without revealing sensitive content.
Erase: where law requires, erase user identifiers by rotating salt/mapper, preserving event integrity. Chain integrity is retained by erasing only derived PII fields, not event core.

Residency¶

Audit data written to region of tenant’s home region; exports enforce regional buckets.

Operational Schema & DDL (illustrative)¶

CREATE TABLE ecs.event_audit (
  tenant_id UUID NOT NULL,
  event_id UUID NOT NULL DEFAULT gen_random_uuid(),
  time TIMESTAMPTZ NOT NULL DEFAULT now(),
  actor_sub_hash STRING NOT NULL,
  actor_iss STRING NOT NULL,
  actor_type STRING NOT NULL,
  action STRING NOT NULL,
  resource_type STRING NOT NULL,
  resource_id STRING NULL,
  resource_path_hash STRING NULL,
  result STRING NOT NULL,
  diff_summary JSONB NULL,
  policy JSONB NULL,
  etag STRING NULL,
  version STRING NULL,
  prev_hash STRING NOT NULL,
  hash STRING NOT NULL,
  ttl_expires_at TIMESTAMPTZ NULL,
  correlation_id STRING NULL,
  crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
  CONSTRAINT pk_event_audit PRIMARY KEY (tenant_id, time, event_id)
) LOCALITY REGIONAL BY ROW
  WITH (ttl = 'on',
        ttl_expiration_expression = 'ttl_expires_at',
        ttl_job_cron = '@hourly');

CREATE INDEX ix_audit_action ON ecs.event_audit (tenant_id, action, time DESC);
CREATE INDEX ix_audit_resource ON ecs.event_audit (tenant_id, resource_type, resource_id, time DESC);

Observability of the Audit Pipeline¶

Metrics: audit_events_total{result}, audit_chain_verify_failures_total, audit_export_jobs_running, audit_export_duration_ms, legal_holds_total.
Traces: ecs.audit.record, ecs.audit.export, ecs.audit.verify (linked to x-correlation-id).
Alerts:
- Chain verification failures > 0 in 5m
- Export job failures > 0 in 15m
- Hot-tier backlog (TTL job lag) > 30m

Runbooks¶

RB-C1: Verify chain integrity for a tenant/day

ecsctl audit verify --tenant t-123 --date 2025-08-20
If fail: fetch daily manifest; recompute locally; compare rootHash.
If mismatch persists: open SEV-2, freeze retention, snapshot hot partition, start forensics.

RB-C2: Respond to auditor request (SOC 2)

Create Evidence Pack for period → POST /v1/audit/exports with sign=true.
Attach role/permission export and change management approvals.
Provide verification steps & public cert/JWKS link.

RB-C3: Legal hold

Set tenant legalHold=true with ticket ID.
Confirm TTL job skips held partitions; replicate warm data to hold bucket.
Audit all actions with LegalHoldSet event.

Acceptance Criteria (engineering hand-off)¶

Audit events emitted for all enumerated domains with hash chain fields populated.
Row-level TTL and nightly Parquet exports with signed manifests (Ent).
Export/eDiscovery API implemented with query DSL, redaction presets, KMS encryption, and presigned downloads.
Role-gated access with SoD; all audit access audited.
Evidence Pack automation and webhook delivered.
ecsctl commands for verify, export, manifest show.
DSAR procedures documented; salt rotation mechanism implemented for pseudonymized identifiers.

Solution Architect Notes¶

Keep manifest signing feature-flagged for Pro; enforce for Enterprise tenants handling regulated workloads.
Consider Merkle tree roots per hour to enable partial range verification at scale.
Align export schemas with common GRC tools to avoid ETL (field names stable, enums documented).
Schedule quarterly integrity drills that verify a random sample of tenants and produce a signed attestation.

Cost Model & FinOps — storage/cache costs per tenant, egress, adapter costs, throttling strategies¶

Objectives¶

Define a transparent, meter-driven cost model and FinOps practices so ECS can:

Attribute platform costs per tenant (showback/chargeback).
Forecast and steer spend for storage, cache, egress, adapters, and observability.
Enforce edition-aware throttles and autoscaling that protect SLOs and budgets.

Cost Architecture (what we meter & how we allocate)¶

flowchart LR
  subgraph Meters
    RQ[Resolve Calls]
    WS[WS/LP Connection Hours]
    RBH[Redis Bytes-Hours]
    SBH[Storage Bytes-Hours (CRDB)]
    EGR[Egress Bytes]
    EVT[Refresh Events]
    ADP[Adapter Ops (read/write/watch)]
    KMS[KMS/Secrets Ops]
    OBS[Telemetry (traces/metrics/logs)]
  end

  Meters --> UL[Usage Ledger (per-tenant, per-day)]
  UL --> CE[Cost Engine (unit price table by region/provider)]
  CE --> SB[Showback/Chargeback]
  CE --> BDG[Budgets & Alerts]
  CE --> FP[Forecasts (q/q)]

Hold "Alt" / "Option" to enable pan & zoom

Usage Ledger (authoritative)¶

Per tenant/day we persist:

resolve_calls, resolve_egress_bytes
ws_connection_hours, longpoll_connection_hours
redis_bytes_hours (avg resident bytes × hours)
crdb_storage_bytes_hours (table + indexes)
snapshots_created, snapshot_bytes_exported
refresh_events_published, dlq_messages
adapter_ops_{provider}.{get|put|list|watch}, adapter_egress_bytes
kms_ops, secret_resolutions
otel_spans_ingested, metrics_series, logs_gb

All meters are derived from production telemetry; aggregation jobs roll up hourly → daily to bound cardinality.

Unit Price Table (configurable per region/provider)¶

Meter	Unit	Key drivers	Example field names (in config)
CRDB storage	GB-hour	data + indexes	`price.crdb.gb_hour[region]`
Redis cache	GB-hour	memory footprint	`price.redis.gb_hour[region]`
Internet egress	GB	API payloads, WS traffic	`price.egress.gb[region]`
Event bus	1K ops	publish + consume	`price.bus.kops[region]`
Adapter API	1K ops	provider calls	`price.adapter.{provider}.kops[region]`
KMS/Secrets	1K ops	decrypt/get	`price.kms.kops[cloud]`, `price.secrets.kops[cloud]`
Observability	GB / span	logs/metrics/traces	`price.obs.logs.gb`, `price.obs.traces.kspans`

Do not hardcode cloud list prices in code; load them from ops-config, per account/contract.

Per-Tenant Cost Formulas (parametric)¶

Let P_* be unit prices, and U_* the usage meters for tenant T in period D.

Cost_CRDB(T,D)   = U_storage_gb_hours * P_crdb_gb_hour
Cost_Redis(T,D)  = U_redis_gb_hours   * P_redis_gb_hour
Cost_Egress(T,D) = U_egress_gb        * P_egress_gb
Cost_Bus(T,D)    = (U_bus_publish + U_bus_consume) / 1000 * P_bus_kops
Cost_Adapter(T,D)= Σ_provider (U_adapter_ops_provider / 1000 * P_adapter_provider_kops)
Cost_KMS(T,D)    = U_kms_ops / 1000 * P_kms_kops + U_secret_gets / 1000 * P_secrets_kops
Cost_Obs(T,D)    = U_logs_gb * P_logs_gb + U_spans_k * P_traces_kspans + U_metrics_series * P_metrics_series
Total(T,D)       = Σ all components

Attribution rules

Redis & CRDB bytes-hours: weighted by tenant_id partition sizes; Redis estimated from on-host key scans sampled each minute × hours.
Egress: sum response sizes at Gateway by tenant (compressed bytes on wire).
Adapter ops: counted at Adapter Hub (SPI calls); include retries (we also report retry rate to optimize).
Observability: charge low cardinality metrics as platform; high-volume logs/spans proportional to tenant action rate.

Unit Economics (starter baselines)¶

Define internal pricebook for showback and external chargeback (if applicable):

Edition	Included monthly quota (guardrails)	Overage rate examples
Starter	5M `resolve_calls`, 5 GB egress, 1 GB-mo storage, 1 GB-mo Redis	per 1M resolves, per GB egress, per GB-mo storage/cache
Pro	50M resolves, 50 GB egress, 10 GB-mo storage, 5 GB-mo Redis, 5M bus ops	tiered overage discounts
Enterprise	Custom commit, premium WS	contract-specific

Quotas power shaping & alerts; they are not hard limits unless policy enforces.

Cost Dashboards & Budgets¶

Tenant Cost Overview

Cost by component (CRDB, Redis, Egress, Bus, Adapter, KMS, Obs)
Unit drivers (resolves, events, bytes) + efficiency KPIs:
- cache_hit_ratio, avg_payload_bytes, retry_rate, propagation_lag
Budget progress bars and anomaly score (see below)

Platform FinOps

Cost per region, per edition, per service
Redis GB-hours vs hit ratio; CRDB GB-hours vs version growth
Egress by route; adapters ops by provider; top 10 hot tenants

Budgets & Alerts

Soft budget: 80% warn, 100% alert, 120% cap recommendation
Anomaly detection (day-over-day z-score) on: egress, adapter ops, logs GB

Throttling & Cost-Shaping Strategies (edition-aware)¶

Cost Driver	Strategy	Control Point
Egress	ETag everywhere, gzip, gRPC compression, field projection (exclude unchanged)	Gateway/Resolve
Resolve QPS	Default long-poll; WS only for eligible tenants; increase poll interval under load	SDK feature flags
Redis GB-hours	Cap value size (≤128KB), SWR ≤ 2s, LRU policy, hash-tag by tenant	Registry/Redis
CRDB growth	TTLs on audit/diffs, archive exports to object store, semantic diffs to reduce payload	Registry/Jobs
Adapter ops	Batch list/put, backoff on 429, delta cursors, import windows	Adapter Hub
Observability	Sample traces (tail-based), logs sampling on INFO, keep structured metrics only	OTEL Collector
Bus ops	Coalesce refresh events per ETag & scope, suppress duplicates within 250ms	Orchestrator
WS costs	Cap connections/tenant, idle timeouts, downgrade to long-poll when idle	WS Bridge

Policy hooks

PDP can obligate tenants to reduce cost (e.g., raise poll interval) when over budget.

Example Tenant Scenarios (illustrative, not a quote)¶

Assume (for math only): P_redis_gb_hour=$0.02, P_crdb_gb_hour=$0.01, P_egress_gb=$0.08, P_bus_kops=$0.15, P_adapter_azapp_kops=$0.50.

Pro Tenant (monthly)¶

30M resolves, avg payload 10 KB compressed, cache hit 90% → egress ≈ 3M * 10KB ≈ 28.6 GB
Redis avg resident 3 GB → 3 GB * 720 h = 2160 GB-h
CRDB storage avg 8 GB → 5760 GB-h
Bus ops 2M (publish+consume)
Adapter AzureAppConfig ops 500k

Costs

Redis: 2160 * 0.02 = $43.20
CRDB: 5760 * 0.01 = $57.60
Egress: 28.6 * 0.08 ≈ $2.29
Bus: 2,000k/1,000 * 0.15 = $300
Adapter: 500k/1,000 * 0.50 = $250
Subtotal ≈ $653 (+ observability overhead if tenant-heavy)

Actionables to reduce: raise hit ratio to 92%, coalesce refreshes, batch adapter writes.

Optimization Playbook (FinOps levers)¶

Reduce payload size: prune keys, compress, project; target ≤12 KB p50.
Increase cache hits: ensure SDKs default long-poll/WS + ETag; fix chatty clients.
Right-size Redis: measure redis_bytes_hours; add shards only if latency p95 > 3 ms or evictions > 0.
Trim storage: enforce TTLs on diffs/audit (per edition), export old versions to Parquet.
Adapter efficiency: switch to delta sync; widen polling intervals; schedule imports in off-peak windows.
Observability costs: cap log verbosity, keep histograms with exemplars; drop high-cardinality labels.
Network: prefer in-region access (avoid x-region egress); push WS only for active namespaces.

Governance: Showback/Chargeback¶

Showback: monthly PDF/CSV per tenant with:
- Usage meters, effective unit prices, total by component, trend chart, anomaly notes.
Chargeback (optional): map to SKU:
- CFG-READ (per 1M resolves), CFG-STOR (per GB-mo), CFG-CACHE (per GB-mo), CFG-EGR (per GB), CFG-ADP-{provider} (per 1K ops).
Contract hooks: Enterprise tenants can pre-commit capacity (discounted) with alerts if sustained > 120% for 3 days.

Automation & Data Flow¶

Tagging/Labels: all resources include cs:tenant, cs:service, cs:env, cs:edition.
Export jobs push Usage Ledger to the cost platform (per cloud): CUR/BigQuery/Billing export + our enrichment.
Forecasting: Holt-Winters on resolves/egress to project 30/90-day spend; include seasonality from release cadence.

Cost-Anomaly Runbooks¶

RB-F1 Egress Spike

Dashboard → per-route egress; find tenant & path.
Check 304 rate; if low, investigate missing ETags or payload bloat.
Temporarily raise client poll interval; enable field projection.

RB-F2 Adapter Op Surge

Identify provider & binding; check throttle events.
Increase batch size if under cap; otherwise backoff and schedule window.
Notify tenant; lock dual-write if drift source identified.

RB-F3 Redis GB-hours Up

Inspect top key sizes & TTL; enforce value size cap.
Increase SWR window to 2s; review pinning policies.
If still high, move cold namespaces to document aggregation (fewer large values, fewer keys).

RB-F4 Observability Overrun

Lower log sampling on INFO; ensure tail-based tracing active.
Drop unused metrics; dedupe labels; compress logs.
Add budget guard at Collector; alert if exceeded.

HPA/KEDA & FinOps Coupling¶

Scale on business metrics (in-flight requests, lag) not raw CPU only.
Scale-down protection when budgets are healthy and latency SLOs are green; avoid oscillation that increases cost via cache churn.
Scheduled scale for known peaks (releases) to reduce reactive over-scale.

Acceptance Criteria (engineering hand-off)¶

Usage Ledger schema implemented; daily rollups persisted and queryable per tenant.
Cost Engine reads regional price tables; produces per-tenant daily cost with component breakdown.
Dashboards: Tenant Cost, Platform FinOps, Anomalies; budgets & alerts wired.
Edition quotas and PDP obligations to shape high-cost behaviors (poll interval, WS eligibility).
Showback export (CSV/PDF) and API endpoints for /billing/usage.
Runbooks RB-F1..RB-F4 published; anomaly jobs (z-score) live.
Unit tests validate meters vs traces/metrics; backfills for late-arriving telemetry.

Solution Architect Notes¶

Keep unit prices externalized; treat FinOps math as configuration, not code.
Revisit pricebook SKUs quarterly; align with actual cloud invoices.
Consider per-tenant Redis only for Enterprise with extreme isolation needs; otherwise hash-tag + ACLs suffice.
Evaluate request-level compression dictionaries for large JSON sections if payloads dominate egress.
Add cost SLOs (e.g., $/1M resolves target) to drive continuous efficiency without harming latency SLOs.

Deployment Topology — AKS clusters, namespaces, regions, multi-AZ, blue/green & canary patterns¶

Objectives¶

Specify how ECS is deployed on Azure Kubernetes Service (AKS) with multi-region, multi-AZ resilience; enforce clean isolation via namespaces; and standardize progressive delivery (blue/green & canary) for services and the Studio UI, aligned to data residency and tenant tiers.

High-Level Topology¶

flowchart LR
  AFD[Azure Front Door + WAF] --> EG[Envoy Gateway (AKS)]
  EG --> YARP[YARP BFF]
  EG --> REG[Config Registry]
  EG --> PDP[Policy Engine]
  EG --> ORC[Refresh Orchestrator]
  EG --> WS[WS/LP Bridge]
  EG --> HUB[Provider Adapter Hub]
  REG <--> CRDB[(CockroachDB Multi-Region)]
  REG <--> REDIS[(Redis Cluster)]
  ORC <--> ASB[(Azure Service Bus)]
  HUB <--> Providers[(Azure/AWS/Consul/Redis/SQL)]
  subgraph AKS Cluster (per region)
    EG
    YARP
    REG
    PDP
    ORC
    WS
    HUB
  end

Hold "Alt" / "Option" to enable pan & zoom

Edge & DNS

Azure Front Door (AFD) + WAF terminates TLS, performs geo-routing, and health probes.
Envoy Gateway in AKS handles per-route authn/authz, RLS, and traffic splitting for rollouts.

Environments, Clusters & Namespaces¶

Environment	Cluster Strategy	Namespaces (examples)	Notes
dev	1 AKS per region (cost-optimized)	`dev-system`, `dev-ecs`, `dev-ops`	Lower SLOs; spot where safe
staging	1 AKS per region	`stg-ecs`, `stg-ops`	Mirrors prod; synthetic traffic
prod	2–3 AKS clusters per geo (one per region)	`prod-ecs`, `prod-ops`, `prod-ecs-blue`, `prod-ecs-green`	Blue/green via namespace swap

Namespace policy

NetworkPolicies deny-all by default; allow only service-to-service with SPIFFE IDs.
Separate ops namespace for collectors, Argo CD/Rollouts, KEDA, Prometheus.

Regions & Data Residency¶

Minimum 2 regions per geo (e.g., westeurope + northeurope) in active-active for stateless services; CockroachDB spans both with REGIONAL BY ROW (tenant home region pinned).
Tenants tagged with homeRegion; AFD routes to nearest allowed region (residency enforced).

Multi-AZ & Scheduling Policy¶

AKS nodepools

np-system (DSv5) for control & gateways (taints: system=true:NoSchedule).
np-services (DSv5) for app pods (balanced across zones ½/3).
np-memq (Memory-optimized) for Redis if self-managed (recommend Azure Cache for Redis).
np-bg for blue/green surges (autoscaled on demand).

Workload policies

topologySpreadConstraints: ½/3 zones, max skew 1.
PodDisruptionBudget: minAvailable 70% for stateless, 60% for WS Bridge.
nodeAffinity: pin CRDB/Redis clients to low-latency pools.
zone-aware readiness: Envoy routes only to pods Ready in the same zone by default.

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector: { matchLabels: { app: ecs-registry } }
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector: { matchLabels: { app: ecs-registry } }
          topologyKey: kubernetes.io/hostname

Data Plane Deployments¶

CockroachDB (managed/self-hosted)¶

Multi-region cluster; leaseholders localized to tenant home region; REGIONAL BY ROW tables.
Zone redundancy: node groups across AZ ½/3 per region; PDB + podAntiAffinity.
Backups to region-local blob storage; cross-region async copies nightly.

Redis¶

Prefer Azure Cache for Redis (Premium/Enterprise) with zone redundancy.
For self-managed: Redis Cluster with 3 primaries × 3 replicas per region; aof=appendfsync-everysec.
Key hash-tagging {tenant} to preserve per-tenant isolation.

Azure Service Bus¶

Premium namespace per geo; consumer groups per service; ForwardTo DLQ and parking-lot queues.

Delivery Patterns — Blue/Green & Canary¶

Service Blue/Green (namespace-based)¶

Each prod cluster hosts both prod-ecs-blue and prod-ecs-green.
AFD → Envoy splits traffic by header/cookie or percentage to Services in the target namespace.
Promote by flipping Envoy HTTPRoute/TrafficSplit to Green and preserving Blue for instant rollback.

# Envoy Gateway (Gateway API) HTTPRoute snippet
kind: HTTPRoute
spec:
  rules:
    - matches: [{ path: { type: PathPrefix, value: /api/ }}]
      filters:
        - type: RequestHeaderModifier # inject tenant headers if needed
      backendRefs:
        - name: ecs-registry-svc-green
          weight: 20
        - name: ecs-registry-svc-blue
          weight: 80

Canary (Argo Rollouts)¶

Strategy: 5% → 25% → 50% → 100% with Analysis between steps using Prometheus queries:
- http_request_duration_ms{route="/resolve",quantile="0.99"} < 400
- http_requests_error_ratio{route="/resolve"} < 0.5%
- pdp_decision_ms_p99 < 20
Rollbacks on analysis failure; Blue remains untouched.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 3m }
        - analysis: { templates: [{ templateName: ecs-slo-checks }] }
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

Studio & Static Assets¶

Blue/green buckets (cdn-blue, cdn-green) behind AFD; swap origin on promote; immutable asset hashes prevent cache poison.

Configuration Cutover (Runtime)¶

Use alias flip (prod-current → vX.Y.Z) independent of service rollout (see Migration section).
Coordinate service canary with config canary for risky changes.

Autoscaling & Surge Capacity¶

HPA for services (in-flight requests, CPU) with zone-balanced scale out.
KEDA for event consumers (Service Bus queue depth; propagation lag).
Maintain surge nodepool (np-bg) with cluster autoscaler max surge to absorb blue/green double capacity.

GitOps & Promotions¶

Argo CD per cluster watches environment repos: apps/dev, apps/stg, apps/prod.
Promotions are PR-driven:
1. Build → sign image (cosign) → update Helm chart values in stg.
2. Run smoke + e2e; Argo Rollouts canary passes gates.
3. Promote to prod-ecs-green; automated canary.
4. Flip Envoy weights to 100% Green; archive Blue after bake.

Security & Secrets in Topology¶

Azure Workload Identity for AKS ↔ AAD; pod-level identities fetch secrets via AKV CSI Driver.
mTLS mesh (SPIFFE IDs) between services; Envoy ext_authz to PDP.
Per-namespace NetworkPolicies and Azure NSGs block lateral movement.

DR & Failover¶

Scenario	Action	RTO/RPO
Single AZ loss	Zone spread + PDB continue service	0 / 0
Single region impairment	AFD shifts to healthy region; CRDB serves local rows; WS/long-poll reconnect	RTO ≤ 5 min / RPO 0
CRDB region outage	Surviving region serves all read requests; writes for tenants pinned to failed region are throttled unless policy allows re-pin	RTO ≤ 15 min / RPO ≤ 5 min (if re-pin)

Runbook hooks

Toggle read-only per tenant if their home region is down; optional temporary re-pin with audit.

Reference Manifests (snippets)¶

PodDisruptionBudget — Registry

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: pdb-ecs-registry, namespace: prod-ecs-green }
spec:
  minAvailable: 70%
  selector: { matchLabels: { app: ecs-registry } }

HPA — WS Bridge (connections)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: hpa-ws-bridge, namespace: prod-ecs-green }
spec:
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: ws_active_connections
        target:
          type: AverageValue
          averageValue: "4000"

NetworkPolicy — namespace default deny

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: prod-ecs-green }
spec:
  podSelector: {}
  policyTypes: ["Ingress","Egress"]

Observability for Rollouts¶

Rollout dashboards: error ratio, p95/p99 latency, WS reconnects, policy denials; burn-rate tiles during canary.
Exemplars link from latency histograms to canary pod traces.
AFD health + per-region SLIs visible to on-call; alerts wired to promotion pipeline.

Acceptance Criteria (engineering hand-off)¶

AKS clusters provisioned in 2+ regions, 3 AZs each; nodepools & taints in place.
Namespaces created with deny-all NetworkPolicies, SPIFFE/mTLS configured.
Envoy Gateway and Gateway API configured with traffic split; Argo Rollouts installed and integrated with Prometheus SLO checks.
Blue/green namespaces (prod-ecs-blue, prod-ecs-green) workable; promotion playbooks done.
HPA/KEDA policies committed for Registry, WS Bridge, Orchestrator, Adapter Hub.
AFD/WAF routes implemented with region & residency rules; health probes tied to Envoy readiness.
DR drills documented: AZ loss, region failover, CRDB re-pin; measured RTO/RPO recorded.

Solution Architect Notes¶

Prefer managed Redis and managed CockroachDB where available to simplify AZ operations.
Keep AFD origin groups per region to avoid cross-region hairpin under partial failures.
For high-value tenants, offer per-tenant canary labels (AFD header) to scope early traffic safely.
Consider Gateway API + Envoy advanced LB for zone-local routing to reduce cross-AZ latency and cost.

CI/CD & IaC — repo layout, pipelines, artifact signing, Helm/Bicep/Pulumi, env promotion policies¶

Objectives¶

Provide a secure-by-default, GitOps-first delivery system for ECS that:

Standardizes repo layout across services, SDKs, adapters, and infra.
Ships reproducible builds with SBOM, signatures, and provenance.
Uses Helm (K8s), Bicep (Azure), and optional Pulumi (multi-cloud/app infra).
Enforces environment promotion policies (SoD, approvals, change windows) integrated with the Policy Engine.

Repository Strategy¶

Repos (hybrid)¶

ecs-platform (monorepo): services (Registry, Policy, Orchestrator, WS Bridge, Adapter Hub), Studio BFF/UI, libraries.
ecs-charts: Helm charts & shared chart library.
ecs-env: GitOps environment manifests (Argo CD app-of-apps), per region/env.
ecs-infra: Azure infra (Bicep modules), optional Pulumi program for cross-cloud.
ecs-sdks: .NET / JS / Mobile SDKs.
ecs-adapters: provider adapters (out-of-process plugins), conformance tests.

Rationale: app code evolves rapidly (monorepo aids refactors); environment state and cloud infra are separated with stricter review controls.

Monorepo layout (ecs-platform)¶

/services
  /registry
  /policy
  /orchestrator
  /ws-bridge
  /adapter-hub
/libs
  /common (logging, OTEL, resiliency, auth)
  /contracts (gRPC/proto, OpenAPI)
/studio
  /bff
  /ui
/tools
  /build (pipelines templates, scripts)
  /dev (local compose, kind)
/.woodpecker | /.github | /azure-pipelines (CI templates)
/VERSION (semantic source of truth)

Build & Release Pipelines (templatized)¶

Pipeline stages (per service)¶

Prepare: detect changes (path filters), restore caches (.NET/Node).
Build: compile, unit tests, lint, license scanning.
Security: SAST, dep scan (Dependabot/Snyk), container scan (Trivy/Grype).
Package: container build (BuildKit), SBOM (Syft), provenance (SLSA v3+).
Sign: cosign keyless (OIDC) or KMS-backed; attach attestations.
Push: ACR (or GHCR) by immutable digest only.
Deploy dev: update ecs-env/dev via PR (Argo CD sync).
E2E/Contracts: run k6/gRPC contracts in dev namespace.
Promote: open PRs to staging and then prod with gates (below).

flowchart LR
  A[Commit/PR]-->B[CI Build+Test]
  B-->C[Security Scans]
  C-->D[Image+SBOM+Provenance]
  D-->E[Cosign Sign]
  E-->F[Push to ACR (by digest)]
  F-->G[PR to ecs-env/dev]
  G-->H[ArgoCD Sync + E2E]
  H-->I{Gates Pass?}
  I--Yes-->J[PR to ecs-env/staging -> canary]
  J-->K[PR to ecs-env/prod -> blue/green]
  I--No-->X[Fail + Rollback]

Hold "Alt" / "Option" to enable pan & zoom

Example: GitHub Actions template (service)¶

name: ci-service
on:
  push: { paths: ["services/registry/**", ".github/workflows/ci-service.yml"] }
  pull_request: { paths: ["services/registry/**"] }
jobs:
  build:
    runs-on: ubuntu-22.04
    permissions:
      id-token: write     # OIDC for cosign keyless
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with: { dotnet-version: "8.0.x" }
      - uses: actions/cache@v4
        with:
          path: ~/.nuget/packages
          key: nuget-${{ runner.os }}-${{ hashFiles('**/*.csproj') }}
      - run: dotnet build services/registry -c Release
      - run: dotnet test services/registry -c Release --collect:"XPlat Code Coverage"
      - name: Build image
        run: |
          docker buildx build -t $IMAGE:$(git rev-parse --short HEAD) \
            --build-arg VERSION=${{ github.ref_name }} \
            -f services/registry/Dockerfile --provenance=true --sbom=true .
      - name: SBOM (Syft)
        run: syft packages dir:. -o spdx-json > sbom.json
      - name: Push & Sign (cosign keyless)
        env:
          COSIGN_EXPERIMENTAL: "true"
          IMAGE: ghcr.io/connectsoft/ecs/registry
        run: |
          docker push $IMAGE:$(git rev-parse --short HEAD)
          DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' $IMAGE:$(git rev-parse --short HEAD))
          cosign sign --yes $DIGEST
          cosign attest --type slsaprovenance --predicate provenance.json $DIGEST
      - name: Open PR to ecs-env/dev
        run: ./tools/build/update-image.sh ecs-env dev registry $DIGEST

Tagging scheme

Source: vX.Y.Z (SemVer) in /VERSION
Image: digest authoritative; tags for traceability: vX.Y.Z, sha-<7>
Chart: appVersion: vX.Y.Z, version: vX.Y.Z+build.<sha7>

Artifact Integrity: SBOM, Signing, Admission¶

SBOM: SPDX JSON produced at build; attached to image and stored in release assets.
Signing: cosign keyless (Fulcio) by default; Enterprise can use KMS-backed keys.
Provenance: SLSA level 3/3+ with GitHub OIDC or Azure Pipeline OIDC attestations.
Cluster admission: Kyverno/Gatekeeper policy requires:
- image pulled by digest,
- valid cosign signature from trusted issuer,
- SBOM label present (org.opencontainers.image.sbom),
- non-root, read-only FS, signed Helm chart (optional, helm provenance).

Example Kyverno snippet

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: verify-signed-images }
spec:
  rules:
    - name: check-cosign
      match: { resources: { kinds: ["Pod"] } }
      verifyImages:
        - imageReferences: ["ghcr.io/connectsoft/ecs/*"]
          attestors:
            - entries:
                - keyless:
                    issuer: "https://token.actions.githubusercontent.com"
                    subject: "repo:connectsoft/ecs-platform:*"

Helm Delivery Model¶

ecs-charts contains:
- charts/ecs-svc (library chart with HPA, PDB, ServiceMonitor, NetworkPolicy, PodSecurity)
- charts/registry, charts/policy, charts/orchestrator, etc. (wrap library chart)
Values layering:
- values.yaml (defaults)
- values.dev.yaml / values.stg.yaml / values.prod.yaml (env)
- values.region.yaml (region overrides)
Argo CD renders Helm with env/region values from ecs-env repo; blue/green namespaces are separate Application objects.

Chart values (excerpt)

image:
  repository: ghcr.io/connectsoft/ecs/registry
  digest: "sha256:abcd..."        # immutable
resources:
  requests: { cpu: "250m", memory: "512Mi" }
  limits:   { cpu: "1500m", memory: "1.5Gi" }
hpa:
  enabled: true
  targetInFlight: 400
pdb:
  minAvailable: "70%"
otel:
  enabled: true

Azure IaC with Bicep (ecs-infra)¶

Modules per domain: network, aks, acr, afd, asb, redis, kv, monitor.
Stacks per environment/region: prod-weu, prod-neu, stg-weu, etc.

Bicep structure

/bicep
  /modules
    aks.bicep
    acr.bicep
    afd.bicep
    asb.bicep
    redis.bicep
    kv.bicep
  /stacks
    prod-weu.bicep
    prod-neu.bicep
    stg-weu.bicep

Snippet: AKS with Workload Identity + CSI KeyVault

module aks 'modules/aks.bicep' = {
  name: 'aks-weu'
  params: {
    clusterName: 'ecs-aks-weu'
    location: 'westeurope'
    workloadIdentity: true
    nodePools: [
      { name: 'system', vmSize: 'Standard_D4s_v5', mode: 'System', zones: [1,2,3] }
      { name: 'services', vmSize: 'Standard_D8s_v5', count: 6, zones: [1,2,3] }
    ]
    addons: { omsAgent: true, keyVaultCsi: true }
  }
}

Policy as Code

Bicep lint + OPA/Conftest gate: deny public IPs on private services, enforce HTTPS, AKS RBAC, diagnostic settings, encryption.

Drift & Cost

Nightly what-if with approval gates.
Infracost checks on PR for FinOps visibility.

Pulumi Option (app/edge infra or multi-cloud)¶

When multi-cloud or richer composition is needed, Pulumi TS/Go program can orchestrate:

AFD routes, DNS, CDN origins,
Argo CD apps (via Kubernetes provider),
Cross-cloud Secrets and KMS bindings.

Pulumi TS excerpt

import * as k8s from "@pulumi/kubernetes";
const appNs = new k8s.core.v1.Namespace("prod-ecs-green", { metadata: { name: "prod-ecs-green" }});
new k8s.helm.v3.Chart("registry",
  { chart: "registry", repo: "oci://ghcr.io/connectsoft/ecs-charts",
    namespace: appNs.metadata.name,
    values: { image: { digest: process.env.IMAGE_DIGEST } }});

Choice: Bicep for Azure account-level infra; Pulumi for higher-level orchestration (optional).

Environment Promotion Policies (gates)¶

Policy Integration

Stg→Prod promotions require PDP decide(operation=deploy.promote):
- SoD enforced (author ≠ approver),
- Risk score below threshold or 2 approvals,
- Change window active for prod,
- Evidence: test results, SLO canary checks green.

Automation

PR from bot updates ecs-env staging → triggers Argo Rollouts canary.
Upon success, bot opens prod PR with:
- Helm value digest pin,
- Rollout strategy set (weights, analysis templates),
- Policy check job posts PDP outcome in PR status.

Required checks on prod PR

E2E suite ✅
SLO canary analysis ✅
Policy PDP decision ✅
Security attestations (cosign verify, SBOM present) ✅
Manual approval (Approver role) ✅

Change freeze

prod branch protected by freeze label; PDP enforces schedule; override requires break-glass with incident ticket.

Preview Environments¶

On PR, create ephemeral namespace pr-<nr> with limited quotas:
- Deploy changed services + Studio UI, seeded with masked sample data.
- TTL controller cleans after merge/close.
- URL: https://pr-<nr>.dev.ecs.connectsoft.io.

Rollback Strategy¶

Service rollback: revert env manifest to previous digest; Argo CD sync; keep blue namespace warm.
Config rollback: flip alias to previous snapshot (constant-time).
Automated: if canary SLOs fail, Rollouts auto-rollback and block prod PR.

Secrets & Credentials in CI/CD¶

CI uses OIDC workload identity to:
- obtain short-lived push token for ACR,
- request cosign keyless cert,
- fetch non-production secrets from KV via federated identity.
No long-lived secrets in CI; repo has secret scanning enforced.

Observability of Delivery¶

Pipelines emit OTEL spans: ecs.ci.build, ecs.ci.scan, ecs.cd.sync, ecs.cd.promote.
Metrics: build duration, success rate, MTTR for rollback, mean lead time.
Dashboards: Delivery (DORA) + Supply Chain (signature verification %, SBOM coverage).

Acceptance Criteria (engineering hand-off)¶

Repos created with scaffolds & templates; path-filtered CI in place.
CI templates produce signed images, SBOM, SLSA provenance, push by digest.
Kyverno/Gatekeeper admission verifying signatures & digest pinning.
ecs-charts library and per-service charts published; ecs-env Argo apps wired for dev/stg/prod (blue/green).
Bicep modules and environment stacks committed; what-if + Conftest on PR; Infracost enabled.
Promotion gates integrated with PDP decisions, SLO canary checks, and manual approvals.
Preview environments auto-spawn for PRs; TTL cleanup automated.
Runbooks for failed canary, admission rejection, rollback, infra drift.

Solution Architect Notes¶

Keep digest-only deploys non-negotiable; tags for humans, digests for machines.
Prefer keyless signing to reduce key mgmt overhead; provide KMS fallback for regulated tenants.
Centralize pipeline templates; services import them with minimal YAML.
Consider Argo CD Image Updater only for lower environments; prod should remain PR-driven with explicit digests.
Reassess scan noise quarterly; failing the build on high CVSS after a grace period keeps the supply chain healthy.

Operational Runbooks — on-call, incident playbooks, hotfix flow, config rollback drill, RTO/RPO¶

On-Call Model¶

Coverage & Roles¶

24×7 follow-the-sun rotations per geo (AMER / EMEA / APAC) with a global Incident Commander (IC) pool.
Roles per incident:
- Incident Commander (IC) — owns timeline/decisions, delegates.
- Ops Lead — drives technical mitigation (AKS, Envoy, Redis, CRDB).
- Service SME — Registry/PDP/Orchestrator/Adapters/Studio.
- Comms Lead — customer/internal comms; status page.
- Scribe — live notes, artifacts, evidence pack.
- Security On-Call — joins if policy/access/secrets involved.

Handoff Checklists¶

Start of Shift

Open NOC dashboard, confirm SLO tiles green for Resolve/Publish/PDP.
Verify Pager healthy; run “test page” to self (silent).
Confirm last handoff notes + open actions.
Review scheduled change windows and release plans.

End of Shift

Update handoff doc with:
- Open incidents, mitigations, remaining risk.
- Any temporary overrides (rate limits, feature flags).
- Pending hotfix/canary states.

Severity & Response Targets¶

SEV	Definition	Examples	Target TTA	Target TTR
SEV-1	Major outage / critical SLO breach	Resolve 5xx>5%, region down, data integrity risk	≤ 5m	≤ 60m
SEV-2	Partial degradation / single-tenant severe	429% for top tenant >10%, propagation lag p95>15s	≤ 10m	≤ 4h
SEV-3	Minor impact / at risk	Canary failing, rising error trend, DLQ growth	≤ 30m	≤ 24h

IC may escalate/downgrade; Security events can be SEV-1 regardless of scope.

Standard Incident Playbook (applies to all)¶

Declare severity, assign roles, open incident channel & ticket.
Stabilize (protect SLOs): enable rate-limits, switch to long-poll, open breakers per-tenant before global, serve stale data if allowed.
Diagnose via run-of-show:
- Check golden signals dashboards (error %, p95/p99 latency, throughput).
- Jump to exemplar traces from SLO tiles.
- Inspect recent deploys (Argo/PR timestamps), feature flags, policy changes.
Mitigate using scenario runbooks below.
Communicate:
- Internal: every 15m or on change of state.
- External: initial notice ≤ 15m for SEV-½; updates 30–60m.
Recover to steady state; back out temporary overrides.
Close with preliminary impact, customer list, follow-ups.
Post-Incident Review within 3–5 business days (template at end).

Scenario Runbooks (diagnostics ➜ actions ➜ exit)¶

RB-S1 Resolve Latency / Error Spike¶

Trigger: p99 Resolve > 400 ms or error% > 1% (5m).

Diagnostics

Dashboards: Resolve, Redis, CRDB, Gateway.
Check cache_hit_ratio, coalesce_ratio, waiter_pool_size.
Compare last deploy & policy changes.

Actions

Raise in-flight HPA target temporarily (+25–50% pods).
If Redis p95>3 ms or evictions>0: scale shards; enable SWR 2s.
If hot tenant: lower tenant RPS; increase poll interval via policy obligation.
If WS unstable: downgrade to long-poll.

Exit

p99 ≤ 250–300 ms for ≥ 10m; error% ≤ 0.2%; roll back any temporary throttles gradually.

RB-S2 Propagation Lag / DLQ Growth¶

Trigger: propagation p95>8 s (15m) or DLQ>100 (10m).

Diagnostics

Orchestrator consumer lag, bus quotas, adapter_throttle_total.

Actions

Scale KEDA consumers; increase batch sizes.
Replay DLQ after confirming transient errors.
If adapter throttled (Azure/AWS): lower batch per binding, increase backoff to 60 s.
Limit publish waves (canary ➜ regional).

Exit

Lag p95 ≤ 5 s; DLQ drained to steady baseline; no re-enqueue loops.

RB-S3 PDP / AuthZ Degradation¶

Trigger: PDP decision p99>20 ms, timeouts at edge, spike in denied decisions.

Diagnostics

PDP latency and cache hits, policy bundle updates, gateway ext_authz errors.

Actions

Scale PDP; warm cache by preloading effective policies.
If PDP unreachable: edge uses last-good TTL ≤ 60 s; deny admin ops.
Roll back recent policy bundle if regression suspected.

Exit

p95 ≤ 5 ms; no authn/z timeouts; revert temp TTLs.

RB-S4 Region Impairment¶

Trigger: AFD probes failing for region; rising cross-zone errors.

Actions

Fail traffic to healthy region via AFD; pause deploys.
Switch affected tenants to read-only if their CRDB home region is down.
Consider temporary re-pin of tenant data (follow CRDB playbook).

Exit

Region restored; backroute to local; resume deploys after smoke tests.

RB-S5 Adapter / Provider Throttling¶

Trigger: adapter_throttle_total>0, provider 429s.

Actions

Reduce per-binding concurrency & batch size; exponential backoff (up to 60 s).
Mark binding degraded; notify tenant; queue ops for replay post-clear.

Exit

No throttles for 30m; backlog cleared; re-enable normal concurrency.

RB-S6 Security Signal (Break-Glass / Suspicious Access)¶

Trigger: break-glass token used; abnormal policy denials; spike in auth failures.

Actions

IC includes Security On-Call; elevate to SEV-1 if needed.
Freeze non-essential changes; enable stricter PDP deny-by-default for admin ops.
Export evidence pack; verify audit hash chain; rotate credentials if necessary.

Exit

Root cause identified, evidence packaged, access reviewed & restored.

Hotfix Flow (code defects)¶

When to hotfix: Reproducible bug in latest release that materially impacts SLOs and cannot be mitigated by config/policy.

Steps

Branch from last prod tag: release/vX.Y.Z-hotfix-N.
Minimal, targeted change + unit tests; bump patch version.
CI: build ➜ SBOM ➜ sign ➜ push digest.
Stage to staging with canary (5% ➜ 25% ➜ 50%); auto SLO analysis gates.
Change window: If closed, IC invokes PDP override (audited).
Promote to prod-green via Argo Rollouts canary; bake 10–15m.
Flip traffic to green; keep blue warm for immediate rollback.
Open follow-up PR to main with the same patch; prevent drift.

Backout

Rollouts auto-rollback on SLO gate failure.
Manual: set weights to 0/100 (green/blue), revert env manifest to previous digest.

Config Rollback Drill (blue/green alias)¶

Prereqs

Current alias prod-current ➜ Blue snapshot v1.8.3.
Candidate Green v1.9.0 already validated.

Drill (quarterly, scripted)

Announce start in #ops; open mock incident ticket.
Flip alias: ecsctl alias set --tenant t-123 --env prod --alias prod-current --version v1.9.0
Verify:
- Registry emits ConfigPublished; Orchestrator fan-out started.
- Sample SDK resolves show new ETag; propagation p95 ≤ 5 s.
Synthetic checks: canary services run health probes against critical keys.
Rollback:
- ecsctl alias set --tenant t-123 --env prod --alias prod-current --version v1.8.3
- Confirm re-propagation; no errors.
Record RTO for both directions; attach evidence (events, traces, timings).
Reset to production state; close ticket with drill metrics.

Success criteria

Cutover and rollback each ≤ 2 minutes end-to-end; zero 5xx spikes; no policy denials.

RTO / RPO Objectives¶

Scenario	Service/Data	RTO	RPO	Notes
Single pod/node loss	All stateless	< 2m	0	HPA + PDB recover
AZ loss	All	< 5m	0	Multi-AZ; zone spread
Region impairment	Resolve (read)	< 5m	0	AFD failover; long-poll reconnect
Region impairment	Writes (tenant home region pinned)	< 15m	≤ 5m*	*If temporary re-pin of tenant rows
Redis shard failover	Cache	< 30s	0	Serve stale (SWR), bypass Redis
PDP degraded	Decisions	< 5m	0	Last-good cache ≤ 60s
CRDB restore (logical)	Snapshots/audit	≤ 60m	≤ 5m	Point-in-time + hourly export
Object store loss (warm tier)	Audit exports	≤ 4h	≤ 24h	Rebuild from hot + backups

Communications Templates¶

Initial External (SEV-½)

We’re investigating degraded performance impacting configuration reads for some tenants starting at HH:MM UTC. Mitigations are in progress. Next update in 30 minutes. Reference: INC-####.

Update

Mitigation active (traffic shifted to Region X). Metrics improving; monitoring continues. ETA to resolution N minutes.

Resolved

Incident INC-#### resolved at HH:MM UTC. Root cause: …. We’ll publish a full review within 5 business days.

Post-Incident Review (PIR) Template¶

Summary: what/when/who/impact scope.
Customer impact: symptoms, duration, tenants affected.
Timeline: detection ➜ declare ➜ mitigation ➜ recovery.
Root cause analysis: technical + organizational.
What worked / didn’t: detection, runbooks, tooling.
Action items: owners, due dates (prevent/mitigate/detect).
Evidence pack: dashboards, traces, logs, audit export.
Policy updates: SLO/SLA, change windows, guardrails.

Toolbelt (quick refs)¶

Rollouts & traffic
- kubectl argo rollouts get rollout registry -n prod-ecs-green
- kubectl argo rollouts promote registry -n prod-ecs-green
Envoy/Gateway
- kubectl get httproute -A | grep registry
- Adjust weights via Helm values PR or emergency patch (IC approval required).
ECSCTL
- ecsctl alias set … (cutover/rollback)
- ecsctl refresh broadcast --tenant t-123 --env prod --prefix features/
- ecsctl audit export --tenant t-123 --from … --to …
KEDA & Bus
- kubectl get scaledobject -A
- az servicebus queue show … --query messageCount

All emergency patches must be reconciled back to Git within the incident window.

Drills & Readiness¶

Monthly: Config rollback drill (per major tenant).
Quarterly: Region failover gameday; Redis shard failover; adapter throttling simulation.
Semi-annual: Full disaster recovery restore test (CRDB PITR + audit verify).
Track drill RTO/RPO and MTTR trends on Ops dashboard.

Acceptance Criteria (engineering hand-off)¶

On-call rota, playbooks, and comms templates published in the Ops runbook repo.
Pager integration wired to SLO burn-rate and key symptom alerts.
“Big Red Button” actions scripted: alias flip, WS➜poll downgrade, tenant rate-limit override.
Drill automation scripts (ecsctl, Helm helpers) committed and documented.
PIR template enforced; incidents cannot be closed without action items and owners.
RTO/RPO objectives encoded in DR test plans with last measured values.

Solution Architect Notes¶

Favor per-tenant containment (rate limits, breakers, bindings) to preserve global SLOs.
Keep mitigation > diagnosis bias in the first 10 minutes; restore service, then dig deep.
Continue enriching runbooks with direct links to dashboards and ready-to-run commands for your platform.
Measure runbook MTTA (time-to-action) during drills; shorten with automation and safe defaults.

Business Continuity & DR — geo-replication, failover orchestration, drills, compliance evidence¶

Objectives¶

Design and prove a business-continuity strategy that keeps ECS available through AZ/region failures while preserving data integrity, tenant isolation, and regulatory evidence. Define geo-replication, orchestrated failover/failback, regular drills, and auditable proof of meeting RTO/RPO.

Continuity Posture at a Glance¶

Layer	Strategy	RTO	RPO	Notes
API/Gateway	Active-active across ≥2 regions (AFD + Envoy)	≤ 5 min	0	Health-based routing; sticky to nearest allowed region
Config Registry data (CRDB)	Multi-region cluster, REGIONAL BY ROW (tenant home region) + PITR	≤ 15 min (with re-pin)	≤ 5 min*	*0 if home region up; ≤5 min on temporary re-pin
Cache (Redis)	Regional clusters, cache as disposable + (Ent: geo-replication)	≤ 30 s	0	Serve stale (SWR) on shard fail; warm on failover
Events (Service Bus)	Premium namespaces per geo + DR alias pairing	≤ 10 min	≤ 1 min	Alias flip; DLQ preserved
Studio/Static	Multi-origin CDN (blue/green buckets)	≤ 5 min	0	Immutable assets
Audit & Exports	Hot in CRDB + nightly Parquet to regional object store	≤ 60 min	≤ 24 h (cold)	Hot data meets app RPO; archive meets compliance
Policy Bundles	Multi-region PDP cache + signed bundles	≤ 5 min	0	Edge “last-good” TTL ≤ 60s

Geo-Replication Topology¶

flowchart LR
  AFD[Azure Front Door + WAF] --> RG1[Envoy/Gateway - Region A]
  AFD --> RG2[Envoy/Gateway - Region B]

  subgraph Region A
    REG1[Registry]-->CRDB[(CockroachDB)]
    PDP1[PDP]
    ORC1[Orchestrator]-->ASB1[(Service Bus A)]
    WS1[WS/LP Bridge]
    REDIS1[(Redis A)]
  end

  subgraph Region B
    REG2[Registry]-->CRDB
    PDP2[PDP]
    ORC2[Orchestrator]-->ASB2[(Service Bus B)]
    WS2[WS/LP Bridge]
    REDIS2[(Redis B)]
  end

  ASB1 <--DR alias--> ASB2
  CRDB --- Multi-Region Replication --- CRDB

Hold "Alt" / "Option" to enable pan & zoom

Key design points

Stateless services are active in all regions; sessionless by design.
CRDB hosts a single logical cluster spanning regions; tenant rows are homed to a region for write locality; reads are global.
Redis is regional. On failover, caches are rebuilt; no cross-region consistency needed.
Service Bus uses Geo-DR alias; producers/consumers bind via alias, not direct namespace name.

Failover Orchestration (Region)¶

Decision Ladder¶

Detect: AFD health probes red OR SLO burn-rate breach OR operator declares.
Decide: IC invokes RegionFailoverPlan (automated policy: partial or full).
Act: Route, Data, Events steps (below) in order.
Verify: SLOs green, data path healthy, backlog drained.
Communicate: status page + tenant comms.
Recover/Failback: after root cause fixed and consistency verified.

Orchestration Steps (automated runbook)¶

sequenceDiagram
  participant Mon as Monitor/SLO
  participant IC as Incident Cmd
  participant AFD as Azure Front Door
  participant GW as Envoy/Gateway
  participant BUS as Service Bus DR
  participant REG as Registry
  participant PDP as Policy
  participant CRDB as CockroachDB

  Mon-->>IC: Region A unhealthy / SLO burn
  IC->>AFD: Disable Region A origins; 100% to Region B
  IC->>BUS: Flip Geo-DR alias to Namespace B
  IC->>REG: Set tenant mode = read-only for homeRegion=A
  alt Extended outage > X min (policy)
    IC->>CRDB: Re-pin affected tenants to Region B (scripted)
    REG-->>PDP: Emit TenantRehomed obligations
  end
  IC->>REG: Trigger warm-up (keyheads) + Refresh broadcast
  Mon-->>IC: SLOs recovered
  IC->>Comms: Resolved update; start failback plan (separate window)

Hold "Alt" / "Option" to enable pan & zoom

Controls

Read-only mode: protects writes for tenants whose home region is down; SDKs continue reads via long-poll.
Tenant re-pin (optional): migrate leaseholders/zone configs for the tenant partitions to Region B; audited and reversible.

Data Protection & Recovery¶

Backups: CRDB full weekly + incremental hourly; PITR ≥ 7 days.
Restore drills: Quarterly logical restores into staging namespace; verify checksums and lineage hashes.
Audit chain: hash-chained audit events verified post-restore to prove integrity.
Object store: nightly Parquet exports (signed manifest for Ent); regional buckets aligned to residency.

Eventing Continuity (Service Bus)¶

Producer/consumer endpoints use DR alias name.
Failover flips alias from Namespace A → B; consumers reconnect automatically.
Ordering: not guaranteed across flip; idempotency keys protect replays.
DLQ in active namespace is exported pre-flip and re-enqueued post-flip.

Redis Strategy¶

Treat as ephemeral: no cross-region state transfer required.
On failover:
- Serve stale for ≤ 2 s (SWR) during shard promotions.
- Fire pre-warm job: touch common keyheads (per tenant/app).
- Observe hit ratio; scale shards if p95 latency > 3 ms or evictions > 0.

Failback (Return to Steady State)¶

Health gate: region green ≥ 60 min; root cause fixed.
Data: if tenants re-pinned, choose stay or migrate back (off-hours).
Traffic: gradually re-enable AFD weight to original distribution.
Events: DR alias back to primary namespace; drain backlog; compare counts.
Post-ops: remove temporary overrides; finalize incident & PIR.

DR Exercises & Drill Catalog¶

Drill	Scope	Frequency	Success Criteria
AZ Evacuation	Evict a zone; validate PDB, no SLO breach	Quarterly	p95 latency steady; 0 error spikes
Region Failover	Full traffic shift + Bus alias flip	Quarterly	RTO ≤ 5 min; Propagation p95 ≤ 5 s
Tenant Re-Pin	Move sample tenants home region	Semi-annual	RPO ≤ 5 min; no cross-tenant impact
CRDB PITR Restore	Point-in-time restore of config set	Semi-annual	Data matches lineage hash; audit chain verifies
Redis Shard Loss	Kill primary; observe recovery	Quarterly	RTO ≤ 30 s; SWR served
Policy/PDP Degrade	Simulate outage	Quarterly	Edge uses last-good; admin ops denied

Automation

ecsctl dr plan <region> prints executable plan with guardrails.
Chaos Mesh/Envoy faults wired to rehearsal scripts.
Drill evidence exported automatically (metrics, traces, manifests).

Runbooks (excerpts)¶

RB-DR1 Region Failover (operator-driven)¶

Declare SEV-1, assign roles.
afdctl region disable --name weu (AFD origin group) ecsctl bus dr-flip --alias ecs-events --to neu
ecsctl tenant set-mode --home weu --mode read-only
If outage > X min: ecsctl tenant repin --tenant t-*
ecsctl cache warmup --region neu --tenant-sample 100
Verify: Resolve p95, error%, propagation, PDP latency.
Comms: status page update; every 30m until resolved.

RB-DR2 Failback¶

Confirm region healthy 60m; run smoke tests.
If re-pinned: schedule change window; repin back or keep.
Flip AFD back; bus alias flip; validate metrics.
Close incident; attach evidence pack.

Compliance Evidence (SOC 2 / ISO 27001)¶

Artifacts produced automatically per drill or real event

DR Runbook Execution Log: timestamped steps, operator IDs, commands, and results.
SLO Evidence: before/after p95/p99, error %, burn-rate graphs with exemplars.
Data Integrity Proof: audit hash-chain verification report; CRDB backup manifest & PITR point.
Change Records: AFD config deltas, Bus alias flips, tenant mode changes (all audited).
Post-Incident Review: root cause, corrective actions, owners & due dates.

Controls mapping

A17 (ISO 27001) / CC7.4 (SOC 2): BC/DR plan, tested with evidence.
A12 / CC3.x: Backups & restorations validated and logged.
A.5 / CC6.x: Roles & responsibilities during incidents clearly defined.

Observability for BC/DR¶

Spans: ecs.dr.failover.plan, ecs.dr.failover.execute, ecs.dr.failback.execute.
Metrics: dr_failover_duration_ms, dr_events_replayed_total, tenants_rehomed_total, afd_origin_healthy.
Alerts:
- Region health red and p95 > 300 ms → Page.
- DR alias mismatch vs intended → Warn.
- audit_chain_verify_failures_total > 0 after restore → Page.

Guardrails & Safety Checks¶

Dry-run for all DR commands (diff/preview).
Two-person rule for tenant re-pin and bus alias flips.
Change windows enforced for failback; failover exempt under SEV-1.
All actions idempotent and audited with correlation IDs.

Acceptance Criteria (engineering hand-off)¶

AFD, Envoy, Service Bus aliasing, and CRDB multi-region configured per topology.
ecsctl DR subcommands implemented (disable/enable region, bus DR flip, tenant mode, re-pin, cache warmup) with dry-run.
Runbooks RB-DR1/RB-DR2 published; drill automation in CI (staging).
Quarterly region failover drill executed with recorded RTO/RPO; evidence pack generated and stored.
CRDB PITR enabled; restore procedure validated and documented.
Alerts and dashboards for DR state live; incident templates linked to BC/DR plan.

Solution Architect Notes¶

Keep tenant re-pin rare and audited; prefer read-only until primary region stabilizes.
Treat Redis as rebuildable; don’t pay cross-region cache tax unless a clear SLA needs it.
For high-regulation tenants, offer enhanced continuity (dual-home with stricter quorum) as an Enterprise add-on.
Rehearse communications as much as technology—clear, timely updates reduce incident impact for customers.

Readiness & Handover — quality gates, checklists, PRR, cutover plan, training notes for Eng/DevOps/Support¶

Objectives¶

Ensure ECS can be safely operated in production by codifying quality gates, pre-flight checklists, a Production Readiness Review (PRR), the cutover run-of-show, and role-specific training for Engineering, DevOps/SRE, and Support. Outputs are actionable, auditable, and aligned to ConnectSoft standards (Security-First, Observability-First, Clean Architecture, SaaS multi-tenant).

Quality Gates (build → deploy → operate)¶

Gate	Purpose	Evidence / Automation	Blocker if Fails
Supply Chain	Provenance & integrity	Cosign signatures verified in admission; SBOM attached; image pulled by digest	✅
Security	Hardening & secrets	SAST/dep scan=Clean or accepted; container scan=No High/Critical; mTLS on; secret refs enforced	✅
Contracts	API/grpc stability	Backward-compat tests; schema compatibility; policy DSL parser tests	✅
Performance	Meet p95/p99	Resolve p99 ≤ targets; publish→refresh p95 ≤ SLO; capacity headroom ≥ 2×	✅
Resiliency	Degrade gracefully	Retry/circuit/bulkhead tests; chaos suite C-1…C-7 green	✅
Observability	Trace/metrics/logs ready	OTEL spans present; dashboards published; SLO alerts wired; exemplars linkable	✅
Compliance	Audit, PII, SoD	Audit hash chains; SoD enforced; retention set; evidence pack job runs	✅
FinOps	Cost guardrails	Usage meters emitting; budgets configured; anomaly jobs active	⚠ (warn)
Ops Runbooks	Operability	Incident & DR runbooks in repo; drills executed within last 90d	✅
Docs & Support	Handover complete	Knowledge base articles; FAQ; escalation tree	⚠

All ✅ gates are mandatory for promotion from staging → production.

Production Readiness Review (PRR)¶

Owner: Solution Architect (chair) + SRE + Security + Product. Format: single session with artifact walkthrough; outcomes recorded.

PRR Checklist (submit 48h before meeting)¶

Architecture & Risks
- Service diagrams current; ADRs for key decisions
- Threat model updated; mitigations tracked
Security & Compliance
- SBOM, signatures, provenance attached to images
- Pen-test/DAST findings triaged (no High open)
- Audit pipeline verified; daily manifest signing (if Ent)
- PII posture reviewed; DSAR playbook linked
Performance & Scale
- Load test report (p99, throughput, coalescing ratio)
- HPA/KEDA policies reviewed; surge plan validated
- Redis sizing workbook & eviction alerts validated
Resiliency & DR
- Chaos results (C-1..C-7) with pass/fail & action items
- Region failover drill RTO/RPO evidence
Observability
- SLOs encoded; burn-rate alerts live
- Dashboards: SRE, Tenant Health, Security, Adapters
- Log redaction tests ✅
Operations
- On-call rota & paging configured; escalation tree
- Runbooks: latency spike, DLQ growth, authZ degrade
- Hotfix flow rehearsed; rollback scripts (ecsctl) present
Change Management
- Policy packs reviewed (change windows, approvals)
- CAB sign-off (if required)
Documentation & Support
- Admin/tenant guides, SDK quickstarts, Studio user guide
- Support KB: top 20 issues + macros
Go/No-Go
- Go-criteria met (see table)
- Backout plan validated

Go / No-Go Criteria¶

Area	Go Criteria	Evidence
Availability	Resolve availability ≥ 99.95% in staging week	SLO metric export
Latency	Resolve p99 ≤ 150 ms (hit) / ≤ 400 ms (miss)	Perf report + dashboards
Security	0 High/Critical vulns exploitable; mTLS enabled	Scan reports, config
DR	Region failover RTO ≤ 5 min; PITR demo ≤ 60 min	Drill logs + evidence pack
Obs	All golden dashboards present; alerts firing in test	Screenshots/links
Ops	On-call & comms templates finalized	Pager config + docs
Cost	Budgets & alerts active; forecast within plan	FinOps dashboard

Pre-Flight Checklists (by domain)¶

Platform (once per region)¶

AFD origins healthy; WAF policies applied
Envoy routes with ext_authz, RLS, traffic splits
AKS nodepools spread across 3 AZs; PDBs applied
Argo CD and Rollouts healthy; image digests pinned
KEDA & HPA metrics sources ready (Prometheus)

Data & Storage¶

CockroachDB multi-region up; leaseholder locality by tenant
PITR window configured; backup jobs green
Redis cluster shards steady; no evictions; latency p95 < 3 ms

Eventing & Adapters¶

Service Bus DR alias validated (flip test in staging)
DLQs empty; replay tested
Adapter credentials & quotas verified; watch cursors in sync

Security¶

JWKS rotation tested; last rotation < 90d
Secret ref resolution (kvref://) e2e test passes
Admission policies (cosign verify, non-root) enforced

Observability¶

OTEL collector pipelines tail-sampling active
SLO burn-rate alerts mapped to PagerDuty
Audit chain verification job succeeded in last 24h

Cutover Plan (run-of-show)¶

Timeline (example anchors)¶

T-7d: Staging canary passes SLO gates; PRR complete; tenant comms drafted
T-1d: Freeze non-essential changes; final data validation; backup checkpoint
T-0: Production green namespace deployed; 5% → 25% → 50% canary with analysis
T+30–60m: Flip traffic 100% to green; keep blue hot for rollback (2–24h window)
T+24h: Post-cutover review; decommission blue after sign-off

Roles¶

IC (lead), Ops Lead, Service SME, Comms Lead, Scribe

Steps (T-0 detailed)¶

Announce start; open bridge & incident room (change record)
Verify prerequisites (see Pre-Flight) and go/no-go check
Argo Rollouts canary start; watch SLO analysis; hold at 50% for 10–15m
Promote to 100%; observe p95/p99, error %, propagation lag
Trigger config canary on limited tenants (prod-next alias) if planned
Flip alias to prod-current; validate refresh events; sample SDK resolve shows new ETag
Hypercare window: monitor, keep blue ready; publish customer update
Close with initial success metrics; schedule 24h review

Backout (any time)¶

Service: set weights to 0/100 (green/blue), sync prev digest
Config: revert alias pointer to last known good version
Comms: issue backout notice; continue root-cause work

Handover Package (what Engineering delivers to Ops/Support)¶

Artifact	Description
Service Runbook	Start/stop, health checks, dependencies, common faults
Resiliency Profile	Timeouts, retries, backoff, bulkheads, circuits (effective values)
SLO/SLA Sheet	SLIs, targets, alert policies, escalation paths
Dashboards Index	Links to Golden Signals, Tenant Health, Security, Adapters
Config Reference	Default values, schema refs, policy overlays per edition
Playbooks	Incident scenarios, DR steps, hotfix flow
Release Notes	Known issues, feature flags, migration notes
Test Evidence	Perf, chaos, DR drills, security scans
Compliance Evidence	Audit manifest sample, retention policy, PII posture
Contact Matrix	SMEs by component, backup ICs, vendor contacts

All artifacts live in ops/runbooks, ops/evidence, and tenant-safe copies in Support KB.

Training Notes (role-based)¶

Engineering¶

Goal: Own code in prod responsibly.
Modules
- Architecture & domain model (2h)
- Resiliency & chaos toolkit (90m)
- Observability deep-dive: tracing taxonomies, SLOs (90m)
- Secure coding + secret refs (60m)
- Release & hotfix procedures (60m)
Labs
- Break and fix: inject 200 ms Redis latency → verify degrade path
- Canary failure drill with Argo Rollouts

DevOps / SRE¶

Goal: Operate at SLO; minimize MTTR.
Modules
- DR & failover orchestration (2h)
- Capacity & autoscaling (HPA/KEDA) (90m)
- FinOps dashboards & budgets (60m)
- Compliance evidence generation (45m)
Labs
- Region failover simulation; measure RTO/RPO
- DLQ growth → replay & recovery

Support / CSE¶

Goal: First-line diagnosis & clear comms.
Modules
- Studio & API basics; common errors (90m)
- Reading dashboards; when to escalate (60m)
- Tenant cost & quota coaching (45m)
Macros & KB
- 304 vs 200 ETag explainer, rate-limit guidance, policy denial decoding, rollback steps
Escalation
- SEV thresholds, IC paging, evidence to capture (tenant, route, x-correlation-id)

Day-0 / Day-1 / Day-2 Operations¶

Day-0 (Cutover): execute plan; hypercare (24h); record metrics snapshot
Day-1 (Stabilize): finalize blue tear-down; adjust HPA/KEDA; clear temporary overrides
Day-2+ (Operate): weekly perf+cost review; monthly rollback drill; quarterly DR/chaos gameday

Checklists (copy-paste ready)¶

Go/No-Go (final 1h)

All SLO tiles green 30m
No High/Critical CVEs in diff since staging
DR alias & AFD health verified
Pager test fired to on-call
Backout commands pasted & dry-run output saved

Post-Cutover (T+60m)

Error% ≤ baseline; p95/p99 ≤ targets
Propagation p95 ≤ 5s; DLQ normal
Customer comms sent; status page updated
Blue kept hot; timer set for 24h review

Acceptance Criteria (engineering hand-off)¶

PRR completed with Go decision; artifacts archived.
All Quality Gates enforced in CI/CD and verified in admission.
Pre-flight checklists executed and recorded per region.
Cutover plan executed with evidence bundle (metrics, traces, audit exports).
Handover package delivered to Ops/Support; training sessions completed and tracked.
Backout tested; RTO/RPO measured and logged.

Solution Architect Notes¶

Keep runbooks short and command-first; link deep docs rather than embedding.
Treat training as a product: measure retention via quarterly drills and refreshers.
Automate evidence packs after PRR, cutover, and drills—compliance should be a by-product of good operations.

📄 External Configuration System - Solution Design Document¶

SDS-MVP Overview – External Configuration System (ECS)¶

🎯 Context¶

📦 Scope of MVP¶

🧑‍🤝‍🧑 Responsibilities by Service¶

🗺️ High-Level Service Map¶

🏢 Tenant and Edition Model¶

📊 Quality Targets¶

✅ Solution Architect Notes¶

Service Decomposition & Boundaries — ECS Core¶

🧩 Components & Responsibilities¶

🧭 Bounded Contexts & Ownership¶

🔐 Tenancy & Scope Enforcement¶

🔄 Interaction Rules (Who May Call Whom)¶

📡 Contracts — Endpoints & Events (MVP surface)¶

🗺️ Component Diagram¶

📚 Key Sequences¶

1) Resolve (read hot path)¶

2) Publish (write warm path)¶

3) Adapter Sync (ingest)¶

📈 NFRs by Service (Solution Targets)¶

🚦 Failure Modes & Backpressure¶

🚀 Deployability & Operability¶

✅ Solution Architect Notes¶

API Gateway & AuthN/AuthZ — Edge Design for ECS¶

Edge Components & Trust Boundaries¶

OIDC Flows & Token Types¶

Scopes, Roles & Permissions¶

Scope Catalogue (prefix ecs.)¶

Role→Scope Mapping (default policy)¶

Tenant & Edition Routing¶

Rate Limiting & Quotas¶

Default Tiers (edition-aware)¶

Security Controls at Edge¶

Error Model & Observability¶

Configuration-as-Code (GitOps)¶

Reference Implementations¶

Envoy (JWT + ext_authz + RLS excerpt)¶

YARP (authZ pass-through + canary)¶

SDK & Client Guidance (Edge Contracts)¶

Operational Runbook (Edge)¶

Solution Architect Notes¶

Public REST API (OpenAPI) – endpoints, DTOs, error model, idempotency, pagination, filtering¶

Scope & positioning¶

Core resource model¶

Endpoint surface (v1)¶

Config sets¶

Items (within a set)¶

Snapshots & diffs¶

Deployments & refresh¶

Policies¶

Cross‑cutting headers & behaviors¶

DTOs (canonical schemas)¶

Representative paths (excerpt)¶

Error model (Problem Details)¶

Idempotency¶

Pagination, sorting & filtering¶

Cursor pagination (default)¶

Sorting¶

Filtering DSL (simple, safe)¶

Typical client flow (sequence)¶

Security & scopes quick map¶

SDK alignment (high‑level)¶

Solution Architect Notes¶

Ready‑to‑build artifacts¶

gRPC Contracts — low‑latency reads & streaming refresh channels¶

Scope & goals¶

Transport & security assumptions¶

Packages & options¶

Services & RPCs (IDL)¶

Messages (core)¶

Streaming refresh contracts¶

Error model & rich details¶

Metadata (headers/trailers)¶

Semantics & performance notes¶

Resolve / ResolveBatch¶

ListKeys¶

RefreshChannel¶

Versioning & compatibility¶

Optional .NET server (code‑first) sketch¶

Scope Catalogue (prefix `ecs.`)¶

`ecs.config.v1.ConfigPublished`¶

`ecs.refresh.v1.CacheInvalidated`¶

`ecs.adapter.v1.SyncCompleted`¶