Skip to content

📄 External Configuration System - Solution Design Document

SDS-MVP Overview – External Configuration System (ECS)

🎯 Context

The External Configuration System (ECS) is ConnectSoft’s Configuration-as-a-Service (CaaS) solution. It provides a centralized, secure, and tenant-aware configuration backbone for SaaS and microservice ecosystems, built on ConnectSoft principles of Clean Architecture, DDD, cloud-native design, event-driven mindset, and observability-first execution.

ECS addresses the operational complexity of managing configuration at scale, offering:

  • Centralized registry with versioning and rollback.
  • Tenant and edition-aware overlays.
  • Refreshable and event-driven propagation.
  • Multi-provider adapters (Azure AppConfig, AWS AppConfig, Redis, SQL/CockroachDB, Consul).
  • REST/gRPC APIs, SDKs, and a Config Studio UI.

📦 Scope of MVP

The MVP will focus on delivering the core configuration lifecycle:

  1. Core Services

    • Config Registry: Authoritative store of configuration items, trees, and bundles.
    • Policy Service: Tenant, edition, and environment overlays with validation rules.
    • Refresh Service: Event-driven propagation to SDKs and client services.
    • Adapter Hub: Extensible connectors to external providers.
  2. Access Channels

    • APIs: REST + gRPC endpoints with OpenAPI/gRPC contracts.
    • SDKs: .NET, JS, Mobile libraries with caching and refresh subscription.
    • Config Studio: UI for admins, developers, and viewers with approval workflows.
  3. Foundational Layers

    • Security-first: OIDC/OAuth2, RBAC, edition-aware scoping, secret management.
    • Observability-first: OTEL spans, metrics, structured logs, trace IDs.
    • Cloud-native: Containerized, autoscaling, immutable deployments.
    • Event-driven: CloudEvents-based change propagation.

🧑‍🤝‍🧑 Responsibilities by Service

Service Responsibility
Config Registry CRUD, versioning, history, rollback of configuration.
Policy Service Edition/tenant overlays, RBAC enforcement, schema validation.
Refresh Service Emits ConfigChanged events, ensures idempotent propagation.
Adapter Hub Provider adapters for Azure AppConfig, AWS AppConfig, Redis, SQL, Consul.
Gateway/API Auth, routing, rate limits, tenant scoping.
SDKs Local caching, refresh subscription, fallback logic.
Config Studio Admin UX, workflow approvals, auditing, policy management.

🗺️ High-Level Service Map

flowchart TD
    ClientApp[Client Apps / SDKs]
    Studio[Config Studio UI]
    Gateway[API Gateway]

    Registry[Config Registry]
    Policy[Policy Service]
    Refresh[Refresh Service]
    Adapter[Adapter Hub]

    ClientApp -->|REST/gRPC| Gateway
    Studio -->|Admin APIs| Gateway
    Gateway --> Registry
    Gateway --> Policy
    Registry --> Refresh
    Refresh --> ClientApp
    Registry --> Adapter
    Adapter --> External[Azure/AWS/Redis/SQL/Consul]
Hold "Alt" / "Option" to enable pan & zoom

🏢 Tenant and Edition Model

  • Multi-Tenant Isolation: Every tenant has an isolated configuration namespace, scoped across DB, cache, and API.

  • Edition Overlays: Global → Edition → Tenant → Service → Instance inheritance chain. Editions (e.g., Free, Pro, Enterprise) overlay baseline config with restrictions and feature toggles.

  • Environments: Dev, Stage, Prod environments are first-class entities, enforced via approval workflows in Studio.

  • RBAC Roles:

    • Admin: Full control of config + secrets.
    • Developer: Can manage configs but not override edition-level rules.
    • Viewer: Read-only access.

📊 Quality Targets

Area Target
Availability 99.9% SLA (single-region MVP), roadmap for multi-region HA.
Latency < 50ms config retrieval via API or SDK cache.
Scalability 100 tenants, 10k config items per tenant in MVP scope.
Security Zero Trust, tenant isolation, no cross-tenant leaks.
Observability 100% trace coverage with traceId, tenantId, editionId.
Resilience Local SDK caching, retry + at-least-once event delivery.
Extensibility Modular provider adapters and schema-based validation.

✅ Solution Architect Notes

  • MVP services must be scaffolded using ConnectSoft.MicroserviceTemplate for uniformity.
  • Registry persistence in CockroachDB with tenant-aware schemas.
  • Hot-path caching via Redis with per-tenant keyspaces.
  • API Gateway secured with OpenIddict / OAuth2 and workload identities.
  • Events propagated via MassTransit over Azure Service Bus (default) with DLQ and replay.
  • SDKs should include offline-first fallback for resilience.
  • Next solution cycles should expand into API contracts, event schemas, and deployment topologies.

Service Decomposition & Boundaries — ECS Core

This section defines the solution‑level decomposition and interaction boundaries for ECS MVP services: Config Registry, Policy Engine, Refresh Orchestrator, Adapter Hub, Config Studio UI, Auth, and API Gateway. It is implementation‑ready for Engineering/DevOps teams using ConnectSoft templates.


🧩 Components & Responsibilities

Component Primary Responsibilities Owns External Calls Exposes
API Gateway Routing, authN/authZ enforcement, rate limiting, tenant routing, request shaping, canary/blue‑green N/A Auth, Core services REST/gRPC north‑south edge
Auth (OIDC/OAuth2) Issuer, scopes, service‑to‑service mTLS, token introspection, role->scope mapping Identity config N/A OIDC endpoints, JWKS
Config Registry CRUD of config items/trees/bundles; immutable versions; diffs; rollback; audit append Registry schema in CockroachDB; audit log Redis (read‑through), Event Bus (emit), DB REST/gRPC: configs, versions, diffs, rollback
Policy Engine Schema validation (JSON Schema), edition/tenant/env overlays, approval policy evaluation Policy rules, schemas Registry (read), Event Bus (emit on policy change) REST: validate, resolve preview; Policy mgmt APIs
Refresh Orchestrator Emits ConfigPublished events; fan‑out invalidations; long‑poll/WebSocket channel mgmt Delivery ledger (idempotency), consumer offsets Event Bus (consume/produce), Redis (invalidate) Event streams; refresh channels
Adapter Hub Pluggable connectors to Azure AppConfig/AWS AppConfig/Consul/Redis/SQL; mirror/sync Adapter registrations; provider cursors Providers; Registry (write via API), Bus (emit) REST: adapter mgmt; background sync jobs
Config Studio UI Admin UX, drafts, reviews, approvals, diffs, audit viewer, policy editor N/A (UI only) Gateway, Auth Browser app; Admin APIs via Gateway

Tech Baseline (all core services): .NET 9, ConnectSoft.MicroserviceTemplate, OTEL, HealthChecks, MassTransit, Redis, CockroachDB (NHibernate for Registry, Dapper allowed in Adapters).


🧭 Bounded Contexts & Ownership

Bounded Context Aggregate Roots Key Invariants Who Can Change
Registry ConfigItem, ConfigVersion, Bundle, AuditEntry Versions are append‑only; rollback creates a new version; paths unique per (tenant, env) Studio (human), Adapter Hub (automated), CI (automation)
Policy PolicyRule, Schema, EditionOverlay Resolve = Base ⊕ Edition ⊕ Tenant ⊕ Service ⊕ Instance; schema must validate pre‑publish Studio (policy admins)
Propagation Delivery, Subscription At‑least‑once; idempotency by (tenant, path, version, etag) Refresh Orchestrator
Adapters ProviderRegistration, SyncCursor, Mapping No direct DB writes; must use Registry APIs; deterministic mapping Adapter Hub

🔐 Tenancy & Scope Enforcement

Layer How Enforced Notes
Gateway JWT claims: tenant_id, scopes[]; per‑tenant rate limits; route constraints Reject cross‑tenant routes
Services Multi‑tenant keyspace (tenant:{id}:...) for cache; DB row filters; repository guards Tenancy middleware in template
Events CloudEvents extensions: tenantId, editionId, environment Dropped if missing/invalid

🔄 Interaction Rules (Who May Call Whom)

  • Studio UI → Gateway → {Registry, Policy}
  • Adapters → Gateway → Registry (never direct DB access)
  • Policy ↔ Registry (read for validation & preview only; Registry calls Policy for validation)
  • Registry → Refresh Orchestrator (emit) → Event BusSDKs/Consumers
  • Orchestrator → Redis (targeted invalidations)
  • Gateway ↔ Auth (token validation/JWKS)

Forbidden: Cross‑context writes (e.g., Policy writing Registry tables), Adapters writing DB directly, services bypassing Gateway for north‑south.


📡 Contracts — Endpoints & Events (MVP surface)

REST (Gateway‑fronted)

  • POST /configs/{path}:save-draft
  • POST /configs/{path}:publish
  • GET /configs/{path}:resolve?version|latest&env&service&instance (ETag, If‑None‑Match)
  • GET /configs/{path}/diff?fromVersion&toVersion
  • POST /policies/validate (body: config draft + context)
  • PUT /policies/rules/{id} / PUT /policies/editions/{editionId}
  • POST /adapters/{providerId}:sync / GET /adapters

gRPC (low‑latency SDK)

  • ResolveService.Resolve(ConfigResolveRequest) → stream / unary
  • RefreshChannel.Subscribe(Subscription) → server stream

CloudEvents (Event Bus topics)

  • ecs.config.v1.ConfigDraftSaved
  • ecs.config.v1.ConfigPublished
  • ecs.policy.v1.PolicyUpdated
  • ecs.adapter.v1.SyncCompleted
  • ecs.refresh.v1.CacheInvalidated

Event fields (common): id, source, type, time, dataRef?; ext: tenantId, editionId, environment, path, version, etag.


🗺️ Component Diagram

flowchart LR
  subgraph Edge
    GW[API Gateway]
    AUTH[Auth (OIDC)]
  end

  subgraph Core
    REG[Config Registry]
    POL[Policy Engine]
    REF[Refresh Orchestrator]
    ADP[Adapter Hub]
  end

  SDK[SDKs/.NET JS Mobile]
  UI[Config Studio UI]
  BUS[(Event Bus)]
  REDIS[(Redis)]
  CRDB[(CockroachDB)]
  EXT[Ext Providers: Azure/AWS/Consul/Redis/SQL]

  UI-->GW
  SDK-->GW
  GW-->REG
  GW-->POL
  REG-- read/write -->CRDB
  REG-- emit -->REF
  REG-- hot read -->REDIS
  REF-- publish -->BUS
  BUS-- notify -->SDK
  REF-- targeted del -->REDIS
  ADP-- API -->GW
  ADP-- sync -->EXT
  GW-- JWT/JWKS -->AUTH
Hold "Alt" / "Option" to enable pan & zoom

📚 Key Sequences

1) Resolve (read hot path)

sequenceDiagram
  participant SDK
  participant GW
  participant REG as Registry
  participant RED as Redis
  participant POL as Policy
  SDK->>GW: GET /configs/{path}:resolve (If-None-Match: etag)
  GW->>RED: GET tenant:path:etag
  alt Cache hit & ETag matches
    RED-->>GW: 304
    GW-->>SDK: 304 Not Modified
  else Miss or ETag mismatch
    GW->>REG: resolve(path, ctx)
    REG->>POL: validate+overlay(draft/version, ctx)
    POL-->>REG: resolved value + etag
    REG->>RED: SET tenant:path -> value, etag, ttl
    REG-->>GW: 200 + value + ETag
    GW-->>SDK: 200 + value + ETag
  end
Hold "Alt" / "Option" to enable pan & zoom

2) Publish (write warm path)

sequenceDiagram
  participant UI as Studio UI
  participant GW
  participant REG as Registry
  participant REF as Orchestrator
  participant BUS as Event Bus
  participant RED as Redis
  UI->>GW: POST /configs/{path}:publish (Idempotency-Key)
  GW->>REG: publish(path, draftId)
  REG-->>REG: create immutable version, append audit
  REG->>RED: DEL tenant:path:* (scoped)
  REG->>REF: emit ConfigPublished(tenant, path, version, etag)
  REF->>BUS: publish CloudEvent
  BUS-->>SDK: wake/notify
  UI-->>UI: 200 Published(version, etag)
Hold "Alt" / "Option" to enable pan & zoom

3) Adapter Sync (ingest)

sequenceDiagram
  participant ADP as Adapter Hub
  participant EXT as Ext Provider
  participant GW
  participant REG as Registry
  ADP->>EXT: fetch changes(since cursor)
  EXT-->>ADP: items delta
  ADP-->>ADP: map/transform (deterministic)
  ADP->>GW: POST /configs/{path}:save-draft (service identity)
  ADP->>GW: POST /configs/{path}:publish
  GW->>REG: publish(...)
  REG-->>ADP: ack + new cursor
Hold "Alt" / "Option" to enable pan & zoom

📈 NFRs by Service (Solution Targets)

Service Latency Throughput Scaling Storage
Gateway < 5ms overhead 5k RPS HPA by RPS & p99 N/A
Registry p99 < 20ms (cached), < 50ms (DB) 2k RPS read / 200 RPS write HPA by CPU + DB queue CockroachDB
Policy p99 < 15ms 1k RPS Co‑locate with Registry; cache schemas Internal store
Orchestrator < 100ms publish 5k msg/s Scale by bus lag (KEDA) Small ledger
Adapter Hub N/A interactive bursty sync Per‑adapter workers Provider cursors
Redis p99 < 2ms >50k ops/s Clustered shards In‑memory

🚦 Failure Modes & Backpressure

Scenario Behavior Operator Signal
Registry DB degraded Serve from Redis cache; reject writes with 503 RETRY Alert: p99 DB > threshold
Event bus outage Buffer in Orchestrator (bounded); degrade to polling Alert: bus lag + DLQ growth
Adapter provider slow Backoff + skip tenant slice; do not block core Adapter sync error rate
Policy validation fail Block publish; keep draft; return violations Policy violation dashboard
Cache stampede Single‑flight per (tenant,path); stale‑while‑revalidate Cache hit ratio dip

Idempotency for publishes: Idempotency-Key header → (tenant, path, hash(body)) stored for TTL to prevent duplicates.


🚀 Deployability & Operability

  • One container per service, health probes: /health/live, /health/ready; startup probe for Registry migrations.
  • Config via ECS itself (bootstrapped defaults from environment), secrets via KeyVault/Key Management.
  • Dashboards: golden signals per service; propagation lag, cache hits, publish failure rate.
  • Policies as code: policy repo with CI validation; signed artifacts promoted with environment gates.

✅ Solution Architect Notes

  • Prefer NHibernate in Registry for aggregate invariants; use Dapper inside Adapter workers for raw mapping speed.
  • Default bus: Azure Service Bus; keep RabbitMQ option in template flags.
  • SDKs ship with long‑poll by default; enable WS later behind a feature flag.
  • Hard rule: Adapters must use public APIs (no private shortcuts) to keep invariants centralized in Registry.
  • Next, solidify OpenAPI/gRPC contracts and CloudEvents schemas to unblock parallel implementation.

API Gateway & AuthN/AuthZ — Edge Design for ECS

The ECS edge enforces identity, tenant isolation, and traffic governance before requests reach core services. This section defines the Envoy/YARP gateway design, OIDC flows, scope/role model, tenant routing, and rate limiting used by:

  • Config Studio (SPA)
  • Public ECS APIs/SDKs (REST/gRPC)
  • Service-to-service (internal)
  • Adapters & webhooks

Outcomes: Zero‑trust ingress, edition-aware routing, consistent scopes, and predictable throttling per tenant/plan.


Edge Components & Trust Boundaries

flowchart LR
  subgraph PublicZone[Public Zone]
    Client[Browsers/SDKs/CLI]
    WAF[WAF/DoS Shield]
    Envoy[Envoy Gateway]
  end

  subgraph ServiceZone[Service Mesh / Internal Zone]
    YARP[YARP Internal Gateway]
    AuthZ[Ext AuthZ/Policy PDP]
    API[Config API]
    Policy[Policy Engine]
    Audit[Audit Service]
    Events[Event Publisher]
  end

  IdP[(OIDC Provider)]
  JWKS[(JWKS Cache)]
  RLS[(Rate Limit Service)]

  Client --> WAF --> Envoy
  Envoy -->|JWT verify| JWKS
  Envoy -->|ext_authz| AuthZ
  Envoy -->|quota| RLS
  Envoy --> YARP
  YARP --> API
  YARP --> Policy
  YARP --> Audit
  Envoy <-->|OAuth/OIDC| IdP
Hold "Alt" / "Option" to enable pan & zoom

Patterns

  • Envoy is the internet-facing gateway (JWT, mTLS upstream, rate limit, ext_authz).
  • YARP is the east–west/internal router (BFF for Studio, granular routing, canary, sticky reads).
  • AuthZ PDP centralizes fine-grained decisions (scope→action→resource→tenant) using PDP/OPA or Policy Engine APIs.

OIDC Flows & Token Types

Client/Use Case Grant / Flow Token Audience (aud) Notes
Config Studio (SPA) Auth Code + PKCE ecs.api Implicit denied; refresh via offline_access (rotating refresh).
Machine-to-Machine (services) Client Credentials ecs.api Used by backend jobs, adapters, CI/CD.
SDK in customer service Token Exchange (RFC8693) or Client Credentials ecs.api Exchange SaaS identity→ECS limited token; preserves act chain.
Support/Break-glass (time-box) Device Code / Auth Code ecs.admin Short TTL, extra policy gates + audit.
Webhooks / legacy integrators HMAC API Key (fallback) n/a Signed body + timestamp; scoped to tenant + resources; rotateable.

Sequence: SPA (Auth Code + PKCE)

sequenceDiagram
  participant B as Browser (SPA)
  participant G as Envoy
  participant I as OIDC Provider
  participant Y as YARP
  participant A as Config API

  B->>G: GET /studio
  G-->>B: 302 → /authorize (PKCE)
  B->>I: /authorize?code_challenge...
  I-->>B: 302 → /callback?code=...
  B->>G: /callback?code=...
  G->>I: /token (code+verifier)
  I-->>G: id_token + access_token(jwt) + refresh_token
  G->>Y: /api/configs (Authorization: Bearer ...)
  Y->>A: forward (mTLS)
  A-->>Y: 200
  Y-->>G: 200
  G-->>B: 200 + content
Hold "Alt" / "Option" to enable pan & zoom

Sequence: S2S (Client Credentials)

sequenceDiagram
  participant Job as Worker/Adapter
  participant I as OIDC Provider
  participant G as Envoy
  participant P as Policy Engine
  participant A as Config API

  Job->>I: POST /token (client_credentials, scopes)
  I-->>Job: access_token(jwt)
  Job->>G: API call + Bearer jwt
  G->>P: ext_authz {sub, scopes, tenant_id, action, resource}
  P-->>G: ALLOW (obligations: edition, rate_tier)
  G->>A: forward (mTLS)
  A-->>G: 200
  G-->>Job: 200
Hold "Alt" / "Option" to enable pan & zoom

Scopes, Roles & Permissions

Scope Catalogue (prefix ecs.)

Scope Purpose Sample Actions
config.read Read effective/declared config GET /configs, /resolve
config.write Modify config (CRUD, rollout) POST/PUT /configs, /rollback
policy.read Read policy artifacts GET /policies
policy.write Manage policies, schemas PUT /policies, /schemas
audit.read Read audit & diffs GET /audit/*
snapshot.manage Import/export snapshots POST /snapshots/export
adapter.manage Manage provider connectors POST /adapters/*
tenant.admin Tenant-level admin ops Keys, members, editions

Scopes are granted per tenant, optional qualifiers: tenant:{id}, env:{name}, app:{id}, region:{code}.

Role→Scope Mapping (default policy)

Role Scopes
tenant-reader config.read, audit.read
tenant-editor config.read, config.write, audit.read
tenant-admin All above + policy.read, policy.write, snapshot.manage, adapter.manage, tenant.admin
platform-admin Cross-tenant (requires x-tenant-admin=true claim + break-glass policy)
support-operator Time-boxed, read-most, write via approval policy

Claims required on JWT

  • sub, iss, aud, exp, iat
  • tenant_id (or tenants array for delegated tools)
  • edition_id, env, plan_tier
  • scopes (space-separated)
  • Optional: act (actor chain), delegated=true, region, app_id

Decision rule: ALLOW = jwt.valid ∧ aud∈{ecs.api,ecs.admin} ∧ scope→action ∧ resource.scope.includes(tenant_id/env/app) ∧ edition.allows(action)


Tenant & Edition Routing

Resolution order (first hit wins):

  1. Host-based: {tenant}.ecs.connectsoft.cloudtenant_id={tenant}
  2. Header: X-Tenant-Id: {tenant_id} (required for M2M/SDK if no host binding)
  3. Path prefix: /t/{tenant_id}/... (CLI & bulk ops)

Validation

  • Verify tenant exists & active in Tenant Registry.
  • Map edition and plan tier to policy & rate tiers.
  • Enforce data residency: route to region cluster (EU/US/APAC).
  • Attach routing headers: x-tenant-id, x-edition-id, x-plan-tier, x-region.

Envoy route (illustrative)

- match: { prefix: "/api/" }
  request_headers_to_add:
    - header: { key: x-tenant-id, value: "%REQ(X-Tenant-Id)%" }
  typed_per_filter_config:
    envoy.filters.http.jwt_authn:
      requirement_name: ecs_jwt
  route:
    cluster: ecs-region-%DYNAMIC_REGION_FROM_TENANT%

YARP internal routes (illustrative)

{
  "Routes": [
    { "RouteId": "cfg", "Match": { "Path": "/api/configs/{**catch}" }, "ClusterId": "config-api" },
    { "RouteId": "pol", "Match": { "Path": "/api/policies/{**catch}" }, "ClusterId": "policy-api" }
  ],
  "Clusters": {
    "config-api": { "Destinations": { "d1": { "Address": "https://config-api.svc.cluster.local" } } },
    "policy-api":  { "Destinations": { "d1": { "Address": "https://policy-api.svc.cluster.local" } } }
  }
}

Rate Limiting & Quotas

Algorithms: Token Bucket at Envoy (global), local leaky-bucket per worker; descriptors include tenant_id, scope, route, plan_tier.

Default Tiers (edition-aware)

Edition Read RPS (burst) Write RPS (burst) Events/min Webhook deliveries/min Notes
Starter 150 (600) 10 (30) 600 120 Best-effort propagation
Pro 600 (2,400) 40 (120) 3,000 600 Priority queueing
Enterprise 2,000 (8,000) 120 (360) 12,000 2,400 Dedicated partitions, regional failover priority

429 Behavior

  • Retry-After header with bucket ETA.
  • Audit an RateLimitExceeded event (tenant-scoped).
  • SDKs respect backoff jittered-expo and honor Retry-After.

Envoy RLS descriptors

domain: ecs
descriptors:
  - key: tenant_id
    descriptors:
      - key: scope
        rate_limit:
          unit: second
          requests_per_unit: 600   # Pro read default

Security Controls at Edge

  • JWT verification at Envoy (per-tenant JWKS cache, kid pinning, TTL ≤ 10m).
  • ext_authz callout to PDP for ABAC/RBAC decision + obligations (edition, plan).
  • mTLS between Envoy↔YARP↔services (SPIFFE/SPIRE or workload identity).
  • CORS locked to Studio & allowed origins (configurable per tenant).
  • HMAC API Keys for webhooks: header X-ECS-Signature: sha256=... over canonical payload + ts.
  • WAF/DoS: IP reputation, geo-fencing by residency, request size caps, JSON schema guard on public write endpoints.
  • Key rotation: JWKS rollover, client secret rotation, API key rotation (max 90d).

Error Model & Observability

Errors

  • 401 invalid/expired tokens → WWW-Authenticate: Bearer error="invalid_token".
  • 403 valid token but insufficient scope/tenant.
  • 429 throttled with Retry-After.
  • 400 schema violations; 415 content-type errors.

Problem Details (RFC 7807): return type, title, detail, traceId, tenant_id.

Telemetry

  • OTEL spans at Envoy & YARP; propagate traceparent, x-tenant-id.
  • Metrics: requests_total{tenant,route,scope}, authz_denied_total, jwks_refresh_seconds, 429_total.
  • Logs: structured, redacted; audit who/what/when/why for authz decisions.

Configuration-as-Code (GitOps)

  • Envoy & YARP configs templated, per environment & region; changes gate via PR + canary.
  • Policy bundles versioned (OPA or Policy Engine exports).
  • Rate-limit tiers read from ECS system config (edition-aware) with hot reload.

Reference Implementations

Envoy (JWT + ext_authz + RLS excerpt)

http_filters:
  - name: envoy.filters.http.jwt_authn
    typed_config:
      providers:
        ecs:
          issuer: "https://id.connectsoft.cloud"
          audiences: ["ecs.api","ecs.admin"]
          remote_jwks: { http_uri: { uri: "https://id.connectsoft.cloud/.well-known/jwks.json", cluster: idp }, cache_duration: 600s }
      rules:
        - match: { prefix: "/api/" }
          requires: { provider_name: "ecs" }

  - name: envoy.filters.http.ext_authz
    typed_config:
      grpc_service: { envoy_grpc: { cluster_name: pdp } }
      transport_api_version: V3
      failure_mode_allow: false

  - name: envoy.filters.http.ratelimit
    typed_config:
      domain: "ecs"
      rate_limit_service: { grpc_service: { envoy_grpc: { cluster_name: rls } } }

YARP (authZ pass-through + canary)

{
  "ReverseProxy": {
    "Transforms": [
      { "RequestHeaderOriginalHost": "true" },
      { "X-Forwarded": "Append" }
    ],
    "SessionAffinity": { "Policy": "Cookie", "AffinityKeyName": ".ecs.aff" },
    "HealthChecks": { "Passive": { "Enabled": true } }
  }
}

SDK & Client Guidance (Edge Contracts)

  • Send tenant: Prefer host-based; otherwise send X-Tenant-Id.
  • Set audience: aud=ecs.api; request only necessary scopes.
  • Respect 429: backoff with jitter; honor Retry-After.
  • Propagate trace: include traceparent for cross-service correlation.
  • Cache: ETag/If-None-Match on read endpoints.

Operational Runbook (Edge)

  • Rollover JWKS: stage new keys, overlap 24h, monitor jwt_authn_failed.
  • Policy hotfix: push PDP bundle; verify authz_denied_total regression.
  • Canary: 5% traffic to new Envoy/YARP images; roll forward on SLO steady 30m.
  • Tenant cutover (region move): drain + update registry → immediate routing update.
  • Incident: spike in 429 → inspect descriptor stats; temporarily bump burst for affected tenant via override CRD.

Solution Architect Notes

  • Decision pending: PDP choice (OPA sidecar vs. centralized Policy Engine API). Prototype both; target <5 ms p95 decision time.
  • Token exchange: Add formal RFC 8693 support to IdP for SDK delegation chains (act claim preservation).
  • Multi‑IdP federation: Map external AAD/Okta groups → ECS roles via claims transformation at IdP.
  • Per‑tenant custom domains: Support config.{customer}.com via TLS cert automation (ACME) and SNI routing.
  • Quota analytics: Expose per‑tenant usage API (current window, limit, projected exhaustion) to Studio.

This design enables secure, tenant-aware, edition-governed access with predictable performance and operational clarity from the very first release.

Public REST API (OpenAPI) – endpoints, DTOs, error model, idempotency, pagination, filtering

Scope & positioning

This section specifies the public, multi-tenant REST API for ECS v1, including resource model, endpoint surface, DTOs, error envelope, and cross‑cutting behaviors (auth, idempotency, pagination, filtering, ETags). It is implementation‑ready for Engineering/SDK agents and aligns with Gateway/Auth design already defined.

Principles

  • Versioned URIs (/api/v1/...) and semantic versions in x-api-version response header.
  • Tenant‑scoped by token (preferred) with optional admin paths that accept {tenantId}.
  • OAuth2/OIDC (JWT Bearer) with scopes: ecs.read, ecs.write, ecs.admin.
  • Problem Details (RFC 7807) with ECS extensions for all non‑2xx results.
  • ETag + If‑Match for optimistic concurrency on mutable resources.
  • Idempotency-Key for POST that create/trigger side effects.
  • Cursor pagination, stable sorting, safe filtering DSL.
  • Observability-first: traceparent, x-correlation-id, x-tenant-id (echo), x-request-id.

Core resource model

Resource Purpose Key fields
ConfigSet Logical configuration package (name, app, env targeting, tags) id, name, appId, labels[], status, createdAt, updatedAt, etag
ConfigItem Key/value (JSON/YAML) entries within a ConfigSet key, value, contentType, isSecret, labels[], etag
Snapshot Immutable, signed version of a ConfigSet id, configSetId, version, createdBy, hash, note
Deployment Promotion of a Snapshot to an environment/segment id, snapshotId, environment, status, policyEval, startedAt, completedAt
Policy Targeting, validation, transform rules id, kind, expression, enabled
Refresh Push/notify clients to reload (per set/tenant/app) id, scope, status

Endpoint surface (v1)

All paths are relative to /api/v1. Admin endpoints are prefixed with /admin/tenants/{tenantId} when cross‑tenant ops are required.

Config sets

Method Path Scope Notes
POST /config-sets ecs.write Create (idempotent via Idempotency-Key)
GET /config-sets ecs.read List with cursor pagination, filtering, sorting
GET /config-sets/{setId} ecs.read Retrieve
PATCH /config-sets/{setId} ecs.write Partial update (If‑Match required)
DELETE /config-sets/{setId} ecs.write Soft delete (If‑Match required)

Items (within a set)

Method Path Scope Notes
PUT /config-sets/{setId}/items/{key} ecs.write Upsert single (If‑Match optional on update)
GET /config-sets/{setId}/items/{key} ecs.read Get single
DELETE /config-sets/{setId}/items/{key} ecs.write Delete single (If‑Match required)
POST /config-sets/{setId}/items:batch ecs.write Bulk upsert/delete; idempotent with key
GET /config-sets/{setId}/items ecs.read List/filter keys, supports accept: application/yaml

Snapshots & diffs

Method Path Scope Notes
POST /config-sets/{setId}/snapshots ecs.write Create snapshot (idempotent via content hash)
GET /config-sets/{setId}/snapshots ecs.read List versions (cursor)
GET /config-sets/{setId}/snapshots/{snapId} ecs.read Get snapshot metadata
GET /config-sets/{setId}/snapshots/{snapId}/content ecs.read Materialized config (JSON/YAML)
POST /config-sets/{setId}/diff ecs.read Compute diff of two snapshots or working set

Deployments & refresh

Method Path Scope Notes
POST /deployments ecs.write Deploy a snapshot to target; idempotent
GET /deployments ecs.read List/filter by set, env, status
GET /deployments/{deploymentId} ecs.read Status, policy results, logs cursor
POST /refresh ecs.write Trigger refresh notifications (idempotent)

Policies

Method Path Scope Notes
POST /config-sets/{setId}/policies ecs.write Create policy
GET /config-sets/{setId}/policies ecs.read List
GET /config-sets/{setId}/policies/{policyId} ecs.read Get
PUT /config-sets/{setId}/policies/{policyId} ecs.write Replace (If‑Match)
DELETE /config-sets/{setId}/policies/{policyId} ecs.write Delete (If‑Match)

Cross‑cutting headers & behaviors

Header Direction Purpose
Authorization: Bearer <jwt> In OIDC; contains sub, tid (tenant), scp (scopes)
Idempotency-Key In UUIDv4 per create/trigger; 24h window, per tenant+route+body hash
If-Match / ETag In/Out Concurrency for updates/deletes; strong ETags on representations
traceparent In W3C tracing; echoed to spans
x-correlation-id In/Out Client-provided; echoed on response and logs
x-api-version Out Semantic API impl version
RateLimit-* Out From gateway (limits & reset)

DTOs (canonical schemas)

openapi: 3.1.0
info:
  title: ConnectSoft ECS Public API
  version: 1.0.0
servers:
  - url: https://api.connectsoft.com/ecs/api/v1
components:
  securitySchemes:
    oauth2:
      type: oauth2
      flows:
        clientCredentials:
          tokenUrl: https://auth.connectsoft.com/oauth2/token
          scopes:
            ecs.read: Read ECS resources
            ecs.write: Create/update ECS resources
            ecs.admin: Cross-tenant administration
  schemas:
    ConfigSet:
      type: object
      required: [id, name, appId, status, createdAt, updatedAt, etag]
      properties:
        id: { type: string, format: uuid }
        name: { type: string, maxLength: 120 }
        appId: { type: string }
        labels: { type: array, items: { type: string, maxLength: 64 } }
        status: { type: string, enum: [Active, Archived] }
        description: { type: string, maxLength: 1024 }
        createdAt: { type: string, format: date-time }
        updatedAt: { type: string, format: date-time }
        etag: { type: string }
    ConfigItem:
      type: object
      required: [key, value, contentType]
      properties:
        key: { type: string, maxLength: 256, pattern: "^[A-Za-z0-9:\\._\\-/]+$" }
        value: { oneOf: [ { type: object }, { type: array }, { type: string }, { type: number }, { type: boolean }, { type: "null" } ] }
        contentType: { type: string, enum: [application/json, application/yaml, text/plain] }
        isSecret: { type: boolean, default: false }
        labels: { type: array, items: { type: string } }
        etag: { type: string }
    Snapshot:
      type: object
      required: [id, configSetId, version, hash, createdAt]
      properties:
        id: { type: string, format: uuid }
        configSetId: { type: string, format: uuid }
        version: { type: string, pattern: "^v\\d+\\.\\d+\\.\\d+$" }
        note: { type: string, maxLength: 512 }
        hash: { type: string, description: "SHA-256 of materialized content" }
        createdBy: { type: string }
        createdAt: { type: string, format: date-time }
    Deployment:
      type: object
      required: [id, snapshotId, environment, status, startedAt]
      properties:
        id: { type: string, format: uuid }
        snapshotId: { type: string, format: uuid }
        environment: { type: string, enum: [dev, test, staging, prod] }
        status: { type: string, enum: [Queued, InProgress, Succeeded, Failed, Cancelled] }
        policyEval: { type: object, additionalProperties: true }
        startedAt: { type: string, format: date-time }
        completedAt: { type: string, format: date-time, nullable: true }
    Problem:
      type: object
      description: RFC 7807 with ECS extensions
      required: [type, title, status, traceId, code]
      properties:
        type: { type: string, format: uri }
        title: { type: string }
        status: { type: integer, minimum: 100, maximum: 599 }
        detail: { type: string }
        instance: { type: string }
        code: { type: string, description: "Stable machine error code" }
        traceId: { type: string }
        tenantId: { type: string }
        violations:
          type: array
          items: { type: object, properties: { field: {type: string}, message: {type: string}, code: {type: string} } }
    ListResponse:
      type: object
      properties:
        items: { type: array, items: { } } # overridden per path via allOf
        nextCursor: { type: string, nullable: true }
        total: { type: integer, description: "Optional total when cheap" }

Representative paths (excerpt)

paths:
  /config-sets:
    get:
      security: [{ oauth2: [ecs.read] }]
      parameters:
        - in: query; name: cursor; schema: { type: string }
        - in: query; name: limit; schema: { type: integer, minimum: 1, maximum: 200, default: 50 }
        - in: query; name: sort; schema: { type: string, enum: [name, createdAt, updatedAt] }
        - in: query; name: order; schema: { type: string, enum: [asc, desc], default: asc }
        - in: query; name: filter; schema: { type: string, description: "See Filtering DSL" }
      responses:
        "200":
          description: OK
          content:
            application/json:
              schema:
                allOf:
                  - $ref: "#/components/schemas/ListResponse"
                  - type: object
                    properties:
                      items:
                        type: array
                        items: { $ref: "#/components/schemas/ConfigSet" }
        default:
          description: Error
          content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } }
    post:
      security: [{ oauth2: [ecs.write] }]
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required: [name, appId]
              properties:
                name: { type: string }
                appId: { type: string }
                description: { type: string }
                labels: { type: array, items: { type: string } }
      parameters:
        - in: header; name: Idempotency-Key; required: true; schema: { type: string, format: uuid }
      responses:
        "201": { description: Created, headers: { ETag: { schema: { type: string } } }, content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }
        "409": { description: Conflict, content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } }

  /config-sets/{setId}:
    get:
      security: [{ oauth2: [ecs.read] }]
      parameters: [ { in: path, name: setId, required: true, schema: { type: string, format: uuid } } ]
      responses: { "200": { content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }, "404": { content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } } }
    patch:
      security: [{ oauth2: [ecs.write] }]
      parameters:
        - { in: path, name: setId, required: true, schema: { type: string, format: uuid } }
        - { in: header, name: If-Match, required: true, schema: { type: string } }
      requestBody:
        content:
          application/merge-patch+json:
            schema: { type: object, properties: { description: {type: string}, labels: { type: array, items: {type: string} }, status: {type: string, enum: [Active, Archived]} } }
      responses:
        "200": { headers: { ETag: { schema: { type: string } } }, content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }
        "412": { description: Precondition Failed, content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } }

  /config-sets/{setId}/items:batch:
    post:
      security: [{ oauth2: [ecs.write] }]
      parameters:
        - { in: path, name: setId, required: true, schema: { type: string, format: uuid } }
        - { in: header, name: Idempotency-Key, required: true, schema: { type: string, format: uuid } }
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                upserts: { type: array, items: { $ref: "#/components/schemas/ConfigItem" } }
                deletes: { type: array, items: { type: string, description: "keys to delete" } }
      responses:
        "200": { description: Applied, content: { application/json: { schema: { type: object, properties: { upserted: {type: integer}, deleted: {type: integer} } } } } }

Error model (Problem Details)

Media type: application/problem+json Base fields: type, title, status, detail, instance ECS extensions:

  • code – stable machine code (e.g., ECS.CONFLICT.ETAG_MISMATCH, ECS.VALIDATION.FAILED)
  • traceId – correlates with OTEL traces
  • tenantId
  • violations[] – field‑level errors (field, message, code)

Examples

  • 409 Conflict (duplicate name): code=ECS.CONFLICT.DUPLICATE_NAME
  • 412 Precondition Failed (ETag): code=ECS.CONFLICT.ETAG_MISMATCH
  • 429 Too Many Requests (rate limit): code=ECS.RATE_LIMITED with RateLimit-* headers

Idempotency

  • Required on POST that create (/config-sets, /deployments, /refresh) or batch‑modify (items:batch) resources.
  • Key scope: {tenantId}|{route}|SHA256(body); window: 24 hours (configurable).
  • Behavior:
    • First request persists idempotency record with response body, status, headers (including ETag, Location).
    • Subsequent identical requests (same key) return cached response with Idempotent-Replay: true.
    • Key collision with different body hash409 with code=ECS.IDEMPOTENCY.BODY_MISMATCH.
  • Safe methods (GET/HEAD) must not accept Idempotency-Key.

Pagination, sorting & filtering

Cursor pagination (default)

  • Query: cursor, limit (1–200; default 50).
  • Response: nextCursor (opaque). When null, end of page.
  • Optional total when cheap to compute; otherwise omitted.

Sorting

  • sort in a whitelist per resource (e.g., name|createdAt|updatedAt).
  • order: asc|desc (default asc).

Filtering DSL (simple, safe)

  • Single filter parameter using a constrained expression language:
    • Grammar: expr := field op value; op := eq|ne|gt|lt|ge|le|in|like
    • Conjunctions with and, or; parentheses supported.
    • Values URL‑encoded; strings in single quotes.
  • Examples:
    • filter=appId eq 'billing-service' and status eq 'Active'
    • filter=labels in ('prod','blue')
    • filter=updatedAt gt '2025-08-01T00:00:00Z'
  • Field allow‑list per resource to prevent injection.

Typical client flow (sequence)

sequenceDiagram
  participant Cli as Client
  participant GW as API Gateway
  participant API as ECS REST API
  participant PE as Policy Engine
  participant RO as Refresh Orchestrator

  Cli->>GW: POST /config-sets (Idempotency-Key)
  GW->>API: JWT(scopes=ecs.write, tid=...) + headers
  API-->>Cli: 201 ConfigSet (ETag)

  Cli->>GW: POST /config-sets/{setId}/items:batch (Idempotency-Key)
  GW->>API: Upserts & deletes
  API-->>Cli: 200 summary

  Cli->>GW: POST /config-sets/{setId}/snapshots (Idempotency-Key)
  GW->>API: create snapshot
  API-->>Cli: 201 Snapshot

  Cli->>GW: POST /deployments (Idempotency-Key)
  GW->>API: deploy snapshot to prod
  API->>PE: Evaluate policies
  API-->>Cli: 202 Deployment accepted

  Cli->>GW: POST /refresh (Idempotency-Key)
  GW->>RO: emit refresh signal
  RO-->>Cli: 202 Accepted
Hold "Alt" / "Option" to enable pan & zoom

Security & scopes quick map

Operation Scope(s)
Read any resource ecs.read
Create/update/delete set/items/snapshots/deployments/policies ecs.write
Cross‑tenant admin (admin paths) ecs.admin

Tenant is resolved from JWT tid claim; server rejects any cross‑tenant attempts on non‑admin routes.


SDK alignment (high‑level)

  • .NET/TypeScript SDKs will generate from this OpenAPI, exposing:

  • EcsClient with typed methods (createConfigSetAsync, listConfigSetsAsync…)

  • Optional resilience policies (retries on 409/412 with ETag refresh).
  • Built‑in Idempotency-Key generator and replay handling.
  • Paged iterators (for await (const set of client.configSets.listAll(filter))).

Solution Architect Notes

  • Filtering DSL: accept this simple grammar now, revisit OData/RSQL compatibility later if needed.
  • Idempotency window: proposed 24h; confirm compliance requirements (auditability vs storage pressure).
  • Max page size: 200 proposed; validate against expected cardinality of items per set.
  • Secrets: isSecret=true values are write‑only; reads must return masked; align with provider capabilities.
  • Multi‑format content: support YAML in Accept for content endpoints; default JSON elsewhere.
  • Admin paths: include /admin/tenants/{tenantId}/... only for ECS Ops/partners; hide from standard tenants.

Ready‑to‑build artifacts

  • OpenAPI 3.1 file (expand from excerpt).
  • Contract tests (SDK acceptance):
    • Idempotency replay
    • ETag/If‑Match conflict
    • Cursor traversal
    • Filter grammar parsing & allow‑list enforcement
  • Gateway route + scope policy generation from tags/paths.

gRPC Contracts — low‑latency reads & streaming refresh channels

Scope & goals

This section defines gRPC service contracts for low‑latency SDK/agent scenarios and push‑based refresh. It complements the REST API by optimizing the read hot path and near‑real‑time change delivery with backpressure, resumability, and idempotent semantics.

Outcomes

  • Unary Resolve with sub‑10 ms in‑cluster latency (cached).
  • Server/bidi Refresh streams with at‑least‑once delivery and client acks.
  • Error details via google.rpc.* for actionable retries/backoff.
  • Strong multi‑tenant isolation via metadata + request fields.

Transport & security assumptions

  • gRPC over HTTP/2, TLS required; mTLS for service‑to‑service.
  • OIDC JWT in authorization: Bearer <jwt> metadata; audience ecs.api.
  • Per‑call deadline required by clients; server enforces sane max (e.g., 5s for unary, 60m for streams).
  • Tenant context required via metadata x-tenant-id (validated against JWT) and echoed in responses.
  • OTEL propagation: grpc-trace-bin/traceparent metadata.

Packages & options

syntax = "proto3";

package connectsoft.ecs.v1;

option csharp_namespace = "ConnectSoft.Ecs.V1";
option go_package       = "github.com/connectsoft/ecs/api/v1;ecsv1";

import "google/protobuf/any.proto";
import "google/protobuf/struct.proto";
import "google/protobuf/timestamp.proto";
import "google/rpc/error_details.proto";

Services & RPCs (IDL)

// Low-latency, cached resolution of effective configuration.
service ResolveService {
  // Resolve a single key or a set path; supports conditional fetch with ETag.
  rpc Resolve(ResolveRequest) returns (ResolveResponse);

  // Batch resolve multiple keys/paths in one RPC (atomic per key).
  rpc ResolveBatch(ResolveBatchRequest) returns (ResolveBatchResponse);

  // Enumerate keys under a path with server-side paging.
  rpc ListKeys(ListKeysRequest) returns (ListKeysResponse);
}

// Push notifications for changes; supports server-stream and bidi with ACKs.
service RefreshChannel {
  // Simple server-stream subscription; ACKs are implicit by flow control.
  rpc Subscribe(Subscription) returns (stream RefreshEvent);

  // Resumable, at-least-once delivery with explicit ACKs on the same stream.
  rpc SubscribeWithAck(stream RefreshClientMessage)
      returns (stream RefreshServerMessage);
}

// (Optional/internal) admin operations for agents and adapters.
service InternalAdmin {
  rpc SaveDraft(SaveDraftRequest) returns (SaveDraftResponse);
  rpc Publish(PublishRequest) returns (PublishResponse);
  rpc Rollback(RollbackRequest) returns (RollbackResponse);
}

Messages (core)

// Common context for resolution.
message Context {
  string environment = 1;        // dev|test|staging|prod
  string app_id      = 2;        // logical application id
  string service     = 3;        // optional microservice id
  string instance    = 4;        // optional instance id
  string edition_id  = 5;        // optional, overrides tenant default
  map<string,string> labels = 10;// optional targeting labels
}

message ResolveRequest {
  string path             = 1;  // e.g., "apps/billing/db/connectionString"
  Context context         = 2;
  string if_none_match_etag = 3; // conditional fetch
  // If version unset, server uses "latest".
  string version          = 4;  // semantic or snapshot id
}

message ResolvedValue {
  google.protobuf.Value value = 1; // JSON value
  string content_type         = 2; // application/json|yaml|text/plain
}

message ResolveResponse {
  bool not_modified          = 1; // true when ETag matches
  string etag                = 2; // strong ETag of resolved content
  ResolvedValue resolved     = 3; // omitted when not_modified=true
  string provenance_version  = 4; // snapshot or semantic version
  google.protobuf.Timestamp expires_at = 5; // cache hint
  map<string,string> meta    = 9;  // server hints (e.g., "stale-while-revalidate":"2s")
}

message ResolveBatchRequest {
  repeated ResolveRequest requests = 1;
}

message ResolveBatchResponse {
  repeated ResolveResponse responses = 1; // 1:1 with requests (same order)
}

message ListKeysRequest {
  string path     = 1;  // list under this prefix
  Context context = 2;
  int32  page_size = 3; // 1..500, default 100
  string page_token = 4;// opaque
  // Optional filtering on labels and name
  string filter = 5;    // e.g., "name like 'conn%'" or "label in ('prod','blue')"
  string order_by = 6;  // "name asc|desc", "updated_at desc"
}

message KeyEntry {
  string key   = 1;
  string etag  = 2;
  google.protobuf.Timestamp updated_at = 3;
  repeated string labels = 4;
}

message ListKeysResponse {
  repeated KeyEntry items = 1;
  string next_page_token  = 2; // empty => end
}

Streaming refresh contracts

// What to watch.
message Subscription {
  // If omitted, server infers tenant from metadata and defaults to latest/any env.
  Context context = 1;
  // Paths to watch. Supports prefix semantics.
  repeated PathSelector selectors = 2;
  // Receive only specific event types.
  repeated EventType event_types = 3;
  // Resumability: provide last committed cursor to continue after reconnect.
  string resume_after_cursor = 4;
  // Heartbeat period requested by client (server may clamp).
  int32 heartbeat_seconds = 5;
}

message PathSelector {
  string value = 1;          // e.g., "apps/billing/**" or exact key
  SelectorType type = 2;     // EXACT|PREFIX|GLOB
}

enum SelectorType { SELECTOR_TYPE_UNSPECIFIED = 0; EXACT = 1; PREFIX = 2; GLOB = 3; }

enum EventType {
  EVENT_TYPE_UNSPECIFIED = 0;
  CONFIG_PUBLISHED = 1;    // new immutable version published
  CACHE_INVALIDATED = 2;   // targeted cache purge
  POLICY_UPDATED = 3;      // policy/edition overlay changed
}

message RefreshEvent {
  string event_id = 1;           // unique id for idempotency
  string cursor   = 2;           // monotonically increasing per-tenant offset
  EventType type  = 3;
  string path     = 4;           // affected key/prefix
  string version  = 5;           // snapshot/semantic version
  string etag     = 6;           // new etag for resolve
  google.protobuf.Timestamp time = 7;
  map<string,string> meta = 8;   // e.g., reason, actor, correlation id
}

message RefreshClientMessage {
  oneof message {
    Subscription subscribe = 1;
    Ack          ack       = 2;
    Heartbeat    heartbeat = 3;
  }
}

message RefreshServerMessage {
  oneof message {
    RefreshEvent event      = 1;
    Heartbeat    heartbeat  = 2;
    Nack         nack       = 3; // e.g., invalid subscription; includes retry info
  }
}

message Ack {
  string cursor = 1;  // last successfully processed cursor
}

message Nack {
  string code           = 1;  // stable machine code, e.g., "ECS.SUBSCRIPTION.INVALID_SELECTOR"
  string human_message  = 2;
  google.rpc.RetryInfo retry = 3; // backoff guidance
}

message Heartbeat {
  google.protobuf.Timestamp time = 1;
  string server_id = 2;
}

Delivery guarantees

  • At-least-once: Events may repeat; dedupe by event_id or cursor.
  • Ordering: Per tenant and selector forward ordering is preserved best-effort; cross‑selector ordering is not guaranteed.
  • Resumption: Provide resume_after_cursor or Ack.cursor to continue after disconnect.
  • Flow control: gRPC backpressure applies. Server honors slow consumers with bounded buffers, then nacks with RESOURCE_EXHAUSTED when limits are exceeded.

Error model & rich details

  • gRPC status codes + google.rpc.*:
    • INVALID_ARGUMENT + BadRequest (schema/filter/selector errors)
    • UNAUTHENTICATED, PERMISSION_DENIED
    • NOT_FOUND, FAILED_PRECONDITION (ETag mismatch on writes in internal admin)
    • RESOURCE_EXHAUSTED + QuotaFailure (rate/tenant quotas)
    • ABORTED (concurrency conflict)
    • UNAVAILABLE + RetryInfo (transient)
  • Common ErrorInfo fields:
    • reason: stable code like ECS.IDEMPOTENCY.BODY_MISMATCH
    • domain: "ecs.connectsoft.cloud"
    • metadata: tenantId, path, cursor, retryAfter

Metadata (headers/trailers)

Incoming (from client)

  • authorization: Bearer <jwt> (required)
  • x-tenant-id: <id> (required if not derivable from host; must match JWT)
  • x-idempotency-key: <uuid> (for InternalAdmin mutating RPCs)
  • traceparent / grpc-trace-bin (optional, recommended)

Outgoing (from server)

  • x-tenant-id, x-edition-id, x-plan-tier (echo)
  • x-cache: hit|miss|revalidated (Resolve*)
  • x-etag: <etag> (Resolve*)
  • Trailers may include google.rpc.status-details-bin for rich error details

Semantics & performance notes

Resolve / ResolveBatch

  • Reads are served from Redis (read‑through) with ETag; DB only on miss.
  • not_modified=true when if_none_match_etag matches current value.
  • expires_at provides a soft TTL hint for SDK cache; SDK should also listen to RefreshChannel for proactive reloads.

ListKeys

  • Server‑validated filter (subset of REST DSL) with allow‑listed fields only.
  • Stable ordering with cursor tokens (opaque).

RefreshChannel

  • Subscribe: simplest server stream; rely on TCP/HTTP2 flow control and client reconnect with resume on disconnects.
  • SubscribeWithAck: recommended for agents and high‑value consumers; server commits delivery when it receives Ack with last cursor.
  • Heartbeats sent every 15–60 s (configurable). Client should close/reopen on missing 3 heartbeats.

Versioning & compatibility

  • Package versioned at connectsoft.ecs.v1.
  • Additive changes only (new fields with higher tags, default behavior preserved).
  • New event types must default to ignore on unknown by SDKs; reserve numeric ranges per family.

Optional .NET server (code‑first) sketch

[ServiceContract(Name = "ResolveService", Namespace = "connectsoft.ecs.v1")]
public interface IResolveService
{
    ValueTask<ResolveResponse> Resolve(ResolveRequest request, CallContext ctx = default);
    ValueTask<ResolveBatchResponse> ResolveBatch(ResolveBatchRequest request, CallContext ctx = default);
    ValueTask<ListKeysResponse> ListKeys(ListKeysRequest request, CallContext ctx = default);
}

[ServiceContract(Name = "RefreshChannel", Namespace = "connectsoft.ecs.v1")]
public interface IRefreshChannel
{
    IAsyncEnumerable<RefreshEvent> Subscribe(Subscription sub, CallContext ctx = default);
    IAsyncEnumerable<RefreshServerMessage> SubscribeWithAck(
        IAsyncEnumerable<RefreshClientMessage> messages, CallContext ctx = default);
}

SDK guidance (client side)

  • Always set a deadline (e.g., 250–500 ms for Resolve, 2–5 s for ResolveBatch).
  • Respect RetryInfo and gRPC status codes; use exponential backoff with jitter for UNAVAILABLE/RESOURCE_EXHAUSTED.
  • Maintain a local cache keyed by (tenant, context, path) with ETag; prefer refresh stream over polling.
  • Deduplicate refresh events by event_id/cursor; checkpoint the last committed cursor.
  • Use per‑tenant channels for isolation and to preserve ordering.

Solution Architect Notes

  • The ACKed bidi stream should be the default for SDKs; simple server-stream fits lightweight/mobile.
  • Implement cursor compaction (periodic checkpoints) to bound replay windows.
  • Consider compressed payloads (grpc-encoding: gzip) for large batch resolves; set max message size caps.
  • Add feature flag to emit CloudEvents envelope on refresh events when integrating with external buses.
  • Validate multi‑region behavior: cursor is region‑scoped; cross‑region failover resets to last replicated cursor.

This contract enables high‑performance, low‑latency config reads and robust, resumable change propagation—ready for codegen and template scaffolding.

Config Versioning & Lineage — semantic versioning, snapshotting, diffs, provenance, rollback, immutability guarantees

Objectives

Define how configuration is versioned, snapshotted, diffed, traced, and rolled back in ECS with cryptographic immutability and auditable lineage—so engineering can implement storage, APIs, and SDK behaviors consistently.


Versioning Model (SemVer + Content Addressing)

Concepts

  • Draft – mutable workspace of a ConfigSet (unpublished).
  • Snapshotimmutable point‑in‑time materialization of a ConfigSet after policy validation.
  • Version (SemVer) – human‑friendly tag (e.g., v1.4.2) aliased to a concrete Snapshot.
  • Alias – semantic pointers like latest, stable, lts, environment pins (prod-current).
  • Content HashSHA-256(canonical-json) of materialized set; primary ETag and immutability anchor.

Rules

  • SemVer (MAJOR.MINOR.PATCH):
    • MAJOR for breaking schema/keys removal, MINOR for additive keys, PATCH for value changes only.
    • Pre‑release allowed for staged rollouts: v1.2.0-rc.1.
  • ETag = base64url(SHA-256(canonical-json)); strong validator for reads and updates.
  • Canonicalization (for hashing/diff):
    • Deterministic key ordering (UTF‑8, lexicographic).
    • No insignificant whitespace; numbers normalized; booleans/literals preserved.
    • Secrets are represented by references (not raw values) when computing hash to avoid exposure and to allow secret rotation without content drift (see “Secrets & Immutability”).

Snapshotting Lifecycle

sequenceDiagram
  participant UI as Studio UI
  participant API as Registry API
  participant POL as Policy Engine
  participant AUD as Audit Store
  participant ORC as Refresh Orchestrator

  UI->>API: Save Draft (mutations…)
  UI->>API: POST /config-sets/{id}/snapshots (Idempotency-Key)
  API->>POL: Validate schema + policy overlays
  POL-->>API: OK (violations=[])
  API-->>API: Materialize canonical JSON (ordered), compute SHA-256
  API->>AUD: Append SnapshotCreated (hash, size, actor, source)
  API-->>UI: 201 Snapshot {snapshotId, etag, materializedHash}
  UI->>API: POST /deployments (optional)
  API->>ORC: Emit ConfigPublished (tenant, set, version/etag)
Hold "Alt" / "Option" to enable pan & zoom

Snapshot Invariants

  • Write‑once: Snapshot rows are append‑only. No UPDATE/DELETE.
  • Hash‑stable: identical content → identical hash → idempotent creation (API returns 200 with existing Snapshot).
  • Schema‑valid: publish fails if JSON Schema or policy guardrails fail.
  • Provenance‑rich: who/when/why/source captured (see below).

Lineage & Provenance

Data captured per Snapshot

Field Description
snapshotId (GUID) Unique identity, primary key.
configSetId Parent set.
semver Optional tag (can move between snapshots); stored in separate mapping.
hash SHA‑256 of canonical materialization.
parents[] Array of parent snapshot IDs (for merges); by default single parent for linear history.
createdAt/By UTC timestamp + normalized actor id (sub@iss).
source `Studio API Adapter:{provider}`.
reason Human message “why” (commit message).
policySummary Validations/approvals references.
sbom (optional) Content provenance (templates/rules versions) for compliance.

Lineage graph (DAG)

  • Default: linear chain per ConfigSet.
  • Merge: Allowed via “import & reconcile” workflows → multi‑parent Snapshot; computed diff is 3‑way.
graph LR
  A((S-001))-->B((S-002))
  B-->C((S-003))
  C-->D((S-004))
  X((Feature-Branch S-00X))-->M((S-005 merge))
  D-->M
Hold "Alt" / "Option" to enable pan & zoom

Engineering may begin with linear history; merge support can be introduced as an additive feature flag.


Diff Strategy

Outputs

  • Structural diff (RFC 6902 JSON Patch) for machine application.
  • Human diff (grouped change list) for Studio UI, with highlights by key/area.
  • Semantic diff (optional) using Schema annotations to mark breaking/additive changes.

Calculation

  1. Materialize both snapshots into canonical JSON.
  2. Run structural diff producing operations: add, remove, replace, move (rare), copy (rare), test (not emitted).
  3. Annotate operations with:
    • Severity (Breaking/Additive/Neutral) via schema.
    • Scope (security‑sensitive vs. safe).
    • Blast radius (estimated by affected services).

API

  • POST /config-sets/{id}/diff — body: { from: snapshotId|semver, to: snapshotId|semver } Returns: { patch: JsonPatch[], summary: { breaking:int, additive:int, neutral:int } }.

Rollback Semantics

  • Rollback = create new Snapshot from historical Snapshot content; never mutates past entries.
  • Metadata:
    • reason = "rollback to S-00A (v1.3.0)"
    • rollbackOf = "S-00A"
  • Events:
    • ConfigRollbackInitiated (audit)
    • ConfigPublished (normal propagation)
  • Safety Checks:
    • Policy re‑validation on rollback content (schemas may have evolved).
    • Optional “force” flag to bypass non‑breaking warnings (requires tenant.admin).
  • Environment pins (aliases) update to point to the new rollback Snapshot.

Immutability Guarantees

Area Guarantee Mechanism
Snapshot content Write‑once, tamper‑evident Append‑only tables, content hash (ETag), optional signing (JWS)
Audit trail Non‑repudiable Append‑only log, sequence IDs, hash‑chain (optional)
SemVer tag Mutable pointer Separate mapping table; changes audited
Aliases/pins Mutable pointer Versioned alias map with audit
Secret values Not embedded in materialized content References (KMS/KeyVault path) substituted at resolve time

Secrets & Immutability: to avoid hash drift and data exposure, Snapshots store secret references (e.g., kvref://vault/secret#version) rather than plaintext. The resolved value is injected at read time by SDK/Resolver based on caller’s credentials.


Storage & Indices (solution‑level)

Tables (logical):

  • ConfigSet(id, tenantId, name, appId, status, createdAt, updatedAt, etagLatest)
  • Snapshot(id, configSetId, hash, createdAt, createdBy, source, reason, policySummaryJson, parentsJson)
  • SnapshotContent(snapshotId, contentJson, sizeBytes) — optional columnar or compressed storage
  • VersionTag(configSetId, semver, snapshotId, taggedAt, taggedBy)unique(configSetId, semver)
  • Alias(configSetId, alias, snapshotId, updatedAt, updatedBy)alias in {latest,stable,lts,dev-current,qa-current,prod-current,…}
  • Audit(id, tenantId, setId?, snapshotId?, action, actor, time, attrsJson)

Indexes (suggested):

  • Snapshot(configSetId, createdAt desc) for latest queries
  • VersionTag(configSetId, semver) and Alias(configSetId, alias)
  • Audit(tenantId, time desc); Audit(snapshotId)

Retention:

  • Keep all Snapshots by default.
  • Optional policy: retain N latest per set (e.g., 200) excluding those pinned by VersionTag/Alias or referenced by Deployments.

Concurrency, ETags & Idempotency

  • Draft mutations: optimistic writes with If-Match: <ETagDraft>.
  • Snapshot creation: idempotent by (tenantId|configSetId|hash); server returns existing Snapshot if hash matches.
  • Version tags: set/update with If-Match on current mapping row; conflict → 412 Precondition Failed.
  • Deployments/Refresh: require Idempotency-Key (as specified in REST cycle).

SDK Alignment

  • Read Path: SDK caches by (tenant, setId, context, path), using Resolve (gRPC) with if_none_match_etag. On RefreshEvent(etag), SDK fetches only when ETag changed.
  • Pinning: Resolve accepts version|semver|alias. When alias used, response includes actual snapshotId/semver for telemetry.
  • Diff Tooling: SDK helper to produce human diff from JsonPatch with path collapsing and schema annotations.

Operational Runbooks

Tagging & Promotion

  • Tag new Snapshot: POST /config-sets/{id}/tags { semver: "v1.4.0" }
  • Update environment pin: PUT /config-sets/{id}/aliases/prod-current -> S-00D
  • Audit both steps and verify ConfigPublished propagation.

Rollback

  • Locate target Snapshot in Studio (diff view vs current).
  • Trigger POST /config-sets/{id}:rollback { to: "S-00B" }
  • Monitor policy re‑validation; confirm alias pins and deployments updated.

Forensics

  • Export lineage: GET /config-sets/{id}/lineage?format=graphml|json
  • Verify hash chain matches SBOM and policy versions.

Example: Materialization & Hash

// Canonical canonical-json (excerpt)
{
  "apps": {
    "billing": {
      "db": {
        "connectionString": "kvref://vault/billing-db#v3",
        "poolSize": 50
      }
    }
  },
  "_meta": {
    "schemaVersion": "2025-08-01",
    "labels": ["prod","blue"]
  }
}
// ETag = base64url(SHA-256(bytes(canonical-json)))
// Secrets will be resolved by SDK/Resolver at runtime using kvref.

Studio UI: Version Tree & Diff UX

  • Tree view of snapshots with badges (semver, aliases, deployments).
  • Diff panel: switch between structural/semantic views, filter by severity (breaking|additive|neutral).
  • Rollback CTA gated by policy result; requires approval if breaking changes detected.

Solution Architect Notes

  • Start with linear lineage (single parent); design tables to allow future DAG merges.
  • Implement canonicalizer as a shared library (used by Registry & SDKs) to avoid hashing inconsistencies.
  • Add optional snapshot signing (JWS) in a future cycle to strengthen tamper evidence and supply chain stories.
  • Confirm secret reference design with Security and Provider Adapter teams to ensure consistent resolution across clouds.
  • Define policy‐driven SemVer bumping: server can recommend version bump level based on semantic diff.

Storage Strategy — CockroachDB (MT‑Aware) & Redis Cache Topology

Objectives

  • Provide a multi‑tenant, multi‑region storage design that preserves immutability for configuration versions and snapshots.
  • Optimize read latency for SDKs/agents via Redis + regional replicas while keeping CockroachDB (CRDB) as the system of record.
  • Enforce edition‑aware retention, partitioning, and operational recoverability (PITR, incremental backups, table‑level restore).

Logical Data Model (solution view)

erDiagram
    TENANT ||--o{ APP : owns
    TENANT ||--o{ ENVIRONMENT : scopes
    APP ||--o{ NAMESPACE : groups
    NAMESPACE ||--o{ CONFIG_SET : defines
    CONFIG_SET ||--o{ CONFIG_VERSION : immutably_versions
    CONFIG_VERSION ||--o{ SNAPSHOT : captures
    CONFIG_VERSION ||--o{ DIFF : compares
    CONFIG_SET ||--o{ POLICY_BINDING : governed_by
    CONFIG_SET ||--o{ ROLLOUT : deployed_via
    ROLLOUT ||--o{ ROLLOUT_STEP : stages
    EVENT_AUDIT }o--|| TENANT : scoped

    TENANT {
      uuid tenant_id PK
      text slug UNIQUE
      text edition  // "Free","Pro","Enterprise"
      string crdb_region_home
    }
    APP {
      uuid app_id PK
      uuid tenant_id FK
      text key  // unique per tenant
      text display_name
    }
    ENVIRONMENT {
      uuid env_id PK
      uuid tenant_id FK
      text name  // "dev","test","prod"
    }
    NAMESPACE {
      uuid ns_id PK
      uuid tenant_id FK
      uuid app_id FK
      text path   // e.g. "payments/api"
    }
    CONFIG_SET {
      uuid set_id PK
      uuid tenant_id FK
      uuid ns_id FK
      uuid env_id FK
      text set_key
      bool is_composite
      text content_hash  // current head
    }
    CONFIG_VERSION {
      uuid version_id PK
      uuid tenant_id FK
      uuid set_id FK
      string semver
      timestamptz created_at
      text author
      text change_summary
      jsonb content  // canonical compiled payload
      text content_hash UNIQUE
      bool is_head   // convenience flag
      text provenance  // URI/commit ref
      bool signed
      bytea signature
    }
    SNAPSHOT {
      uuid snapshot_id PK
      uuid tenant_id FK
      uuid version_id FK
      timestamptz captured_at
      jsonb content
      text source  // "manual","pre-rollout","post-rollout"
    }
    DIFF {
      uuid diff_id PK
      uuid tenant_id FK
      uuid from_version_id FK
      uuid to_version_id FK
      jsonb diff_json  // RFC6902/semantic diff
      text strategy    // "jsonpatch","semantic"
    }
    POLICY_BINDING {
      uuid policy_id PK
      uuid tenant_id FK
      uuid set_id FK
      jsonb rules
    }
    ROLLOUT {
      uuid rollout_id PK
      uuid tenant_id FK
      uuid set_id FK
      uuid target_env_id FK
      text status  // "planned","in-progress","complete","failed","rolled-back"
      text strategy  // "all-at-once","canary","wave"
    }
    ROLLOUT_STEP {
      uuid step_id PK
      uuid tenant_id FK
      uuid rollout_id FK
      int step_no
      jsonb selector  // services/pods/regions
      jsonb outcome
    }
    EVENT_AUDIT {
      uuid event_id PK
      uuid tenant_id FK
      text type
      jsonb payload
      timestamptz at
      text actor
      text trace_id
    }
Hold "Alt" / "Option" to enable pan & zoom

Notes

  • tenant_id scopes every row. DTOs and queries must supply tenant_id (and often env_id) for isolation and index selectivity.
  • CONFIG_VERSION.content is the immutable compiled document delivered to SDKs (policy already applied). Raw fragments (if used) live in internal tables or object storage and are referenced via provenance.

Physical Design in CockroachDB

Multi‑Region & Locality

  • Database locality: ALTER DATABASE ecs SET PRIMARY REGION <home>; ADD REGION <others>; SURVIVE REGION FAILURE;
  • Table locality:
    • Control/lookup tables: REGIONAL BY ROW with column crdb_region (derived from tenant’s home or set/rollout target).
    • Global reference tables (rare): GLOBAL (e.g., editions, feature flags) to avoid cross‑region fan‑out.
  • Write routing: SDK/Studio via Gateway injects crdb_region/tenant_id; CRDB routes to nearest leaseholder.

Keys, Sharding & Indexes

  • Primary keys: hash‑sharded to avoid hot‑ranges.
    • Example: PRIMARY KEY (tenant_id, set_id, semver) USING HASH WITH BUCKET_COUNT = 16
  • Surrogate IDs: prefer UUIDv7 for time‑ordered locality; store also semver for human lookup.
  • Secondary indexes (all with STORING where helpful):
    • CONFIG_VERSION(tenant_id, set_id, is_head DESC, created_at DESC)
    • CONFIG_VERSION(tenant_id, content_hash) for fast idempotency checks.
    • ROLLOUT(tenant_id, target_env_id, status) for orchestration queries.
  • JSONB columns (content, diff_json) gain targeted GIN indexes for frequently filtered paths (e.g., /featureToggles/*).

Concurrency, Immutability & Idempotency

  • New version creation:
    • Compute content_hash; reject duplicates (idempotent PUT).
    • Mark prior is_head = false atomically.
    • Use SERIALIZABLE transactions with small write sets (CRDB default).
  • No in‑place edits of CONFIG_VERSION.content. Rollback = new version with change_summary="revert to X" and pointer flip.

Sample DDL (illustrative)

-- Multi-region setup (performed once)
ALTER DATABASE ecs SET PRIMARY REGION eu-central;
ALTER DATABASE ecs ADD REGION eu-west;
ALTER DATABASE ecs ADD REGION us-east;
ALTER DATABASE ecs SURVIVE REGION FAILURE;

-- Config versions table
CREATE TABLE ecs.config_version (
  tenant_id UUID NOT NULL,
  set_id UUID NOT NULL,
  version_id UUID NOT NULL DEFAULT gen_random_uuid(),
  semver STRING NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  author STRING NULL,
  change_summary STRING NULL,
  content JSONB NOT NULL,
  content_hash STRING NOT NULL,
  provenance STRING NULL,
  is_head BOOL NOT NULL DEFAULT false,
  signed BOOL NOT NULL DEFAULT false,
  signature BYTES NULL,
  crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
  CONSTRAINT pk_config_version PRIMARY KEY (tenant_id, set_id, semver) USING HASH WITH BUCKET_COUNT = 16,
  UNIQUE (tenant_id, content_hash),
  UNIQUE (tenant_id, set_id, version_id)
) LOCALITY REGIONAL BY ROW;

CREATE INDEX ix_config_version_head ON ecs.config_version (tenant_id, set_id, is_head DESC, created_at DESC) STORING (content_hash, version_id);

-- TTL example for audit events (see Retention section)
CREATE TABLE ecs.event_audit (
  tenant_id UUID NOT NULL,
  event_id UUID NOT NULL DEFAULT gen_random_uuid(),
  type STRING NOT NULL,
  payload JSONB NOT NULL,
  at TIMESTAMPTZ NOT NULL DEFAULT now() ON UPDATE now(),
  actor STRING NULL,
  trace_id STRING NULL,
  crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
  ttl_expires_at TIMESTAMPTZ NULL,
  CONSTRAINT pk_event_audit PRIMARY KEY (tenant_id, at, event_id)
) LOCALITY REGIONAL BY ROW
  WITH (ttl = 'on', ttl_expiration_expression = 'ttl_expires_at', ttl_job_cron = '@hourly');

Redis Cache Topology (read path acceleration)

Topology

  • Redis Cluster (3+ shards per region, replicas ×1) deployed per cloud region close to ECS Gateway.
  • Tenant‑pinned key hashing using hash‑tags to keep hot tenants together and enable selective scaling: ecs:{tenantId}:{env}:{ns}:{setKey}:v{semver} → value = compact binary (MessagePack) or gzip JSON.
  • SDK Near‑Cache (optional) + soft TTL to bound staleness without thundering herds.

Cache Patterns

Concern Pattern
Freshness Pub/Sub channel ecs:invalidate:{tenantId} from Refresh Orchestrator; SDKs subscribe (websocket) or poll backoff.
Warm‑up On new head: background warmers push into Redis (read‑through fallback to CRDB if miss).
Stampede control Singleflight with Redis SET key lock NX PX=<ms>; losers await pub/sub invalidate.
Multi‑region Writes publish region‑scoped invalidations; cross‑region mirror via replication stream (or event bus) for global tenants.
Security Values include HMAC(content_hash); SDKs verify before use.

TTL & Sizing

  • Default TTL = 5–30s (edition‑dependent), max TTL for offline resilience (e.g., 10m).
  • Memory policy: allkeys-lru (Free/Pro), volatile-lru (Enterprise with pinning for critical keys).
  • Per‑tenant quotas enforced by keyspace cardinality + prefix scanning in ops tools.

Partitioning Strategy

By Tenant & Region

  • Every table keyed by tenant_id; hash‑sharded PK prevents hot ranges.
  • REGIONAL BY ROW + crdb_region places data near compute. Rollouts targeting us-east can place derived rows there.

By Time (high‑volume tables)

  • EVENT_AUDIT and rollout telemetry: composite PK (tenant_id, at, event_id) enables time‑bounded range scans and row‑level TTL.
  • Optional monthly export/compact to object storage for long‑term archives.

Large Payloads

  • Keep operational content (<= 128 KB) in CRDB JSONB; store oversized artifacts (e.g., generated diffs > 1 MB) in object storage and reference by URI in provenance.

Retention & Data Lifecycle

Data class Default Edition overrides Mechanism
Head versions (is_head = true) Keep indefinitely none
Historical config versions 365 days Enterprise: infinite / policy based soft policy + archival export
Snapshots (pre/post rollout) 180 days Enterprise: 730 days CRDB row‑level TTL on ttl_expires_at
Audit events, access logs 90 days Enterprise: 365 days TTL + periodic export (Parquet)
Diff artifacts 90 days Pro/Ent: 180 days TTL

Implementation

  • Set ttl_expires_at = now() + INTERVAL 'N days' by class & edition.
  • Nightly export of expiring partitions to object storage (Parquet) before purge.

Backup & Restore (RPO/RTO)

Objectives

  • RPO ≤ 5 minutes (Enterprise), ≤ 15 minutes (Pro), ≤ 24 hours (Free).
  • RTO ≤ 30 minutes for tenant‑scoped table restore; ≤ 2 hours full cluster DR (Enterprise).

Strategy

  • Cluster‑wide scheduled backups:
    • Weekly full + hourly incremental to cloud bucket (regional replica buckets for locality).
    • Encryption with cloud KMS; rotate keys quarterly.
  • Changefeeds (optional) for external archival/analytics sinks.
  • PITR: CRDB protected timestamps on back up schedule to enable point‑in‑time restore.
  • Tenant‑scoped restore: use export/restore with tenant_id predicate via AS OF SYSTEM TIME + SELECT INTO staging + validated merge (playbook below).

Operational Runbook (excerpt)

  1. Incident triage: identify tenant/environment, time window, affected tables.
  2. Quarantine writes: toggle tenant read‑only via Policy Engine; invalidate Redis keys.
  3. Staging restore: create temporary DB from the latest full+incremental or PITR to the timestamp.
  4. Diff & verify: compare CONFIG_VERSION by content_hash; verify signatures/HMAC; rehearse on staging.
  5. Merge: upsert corrected rows by (tenant_id, set_id, semver) into prod; rebuild is_head where needed (transaction).
  6. Warm cache: repopulate Redis; emit RefreshRequested events.
  7. Close incident: re‑enable writes; attach audit and post‑mortem.

Observability & Capacity Planning

Key Metrics

  • CRDB: sql_bytes, ranges, qps, txn_restarts, kv.raft.process.logcommit.latency, per‑region leaseholder distribution.
  • Redis: hits/misses, evictions, blocked_clients, latency, keyspace_per_tenant.
  • ECS: cache hit ratio (SDK), p95 config fetch latency, rollout convergence time.

Back‑of‑Envelope Sizing (initial)

  • Avg config payload: 8–32 KB (compressed on wire).
  • Tenant cardinality: N_tenants; each with ~A apps × E env × S sets × V versions.
  • Storage ≈ N * A * E * S * V * 24 KB + indexes (~1.6×). Start with 3‑region, 9‑node CRDB (3 per region, n2-standard-8 class) and 3× Redis shards per region.

Integration with Refresh Orchestrator

  • On CONFIG_VERSION commit (is_head=true flip), emit ConfigHeadChanged event.
  • Orchestrator:
    1. Upsert compiled payload to Redis {tenant} shard in write region.
    2. Publish ecs:invalidate:{tenantId} with (setKey, semver, content_hash).
    3. For canary/waves, publish scoped invalidations using {selector}.

Security Considerations

  • Row‑level scoping by tenant_id enforced in all data access paths; Gateway injects claims → service verifies.
  • At‑rest encryption (CRDB/TDE if available) + backup encryption with KMS.
  • Redis: TLS, AUTH/ACLs, key prefix isolation, and value MAC (HMAC of content_hash + tenant secret) validated by SDKs.

Solution Architect Notes

  • If config content > 128 KB per set becomes common, move content to object storage and store pointers + hashes in CRDB; keep small materialized views for hot paths in Redis.
  • Validate whether per‑tenant PITR is required for Free/Pro; if not, restrict to Enterprise to control storage cost.
  • Decide diff strategy default (jsonpatch vs semantic) per SDK’s needs; semantic tends to be friendlier for rollout previews.

Acceptance Criteria (engineering hand‑off)

  • DDL migrations emitted with multi‑region locality, hash‑sharded PKs, and required indexes.
  • Row‑level TTL configured for audit / diff tables with edition‑aware durations.
  • Redis cluster charts (Helm) with per‑region topology, metrics, and alerting; key naming spec published to SDK teams.
  • Backup schedules, KMS bindings, and restore playbook present in runbooks; table‑level rehearsal completed in staging.
  • Telemetry dashboards showing p95 CRUD, cache hit %, restore drill duration, and rollout convergence.

Eventing & Change Propagation — CloudEvents, topics, delivery semantics, DLQs, replay; MassTransit conventions

Objectives

Define the event model and propagation pipeline used by ECS to notify SDKs/services of configuration changes with predictable semantics:

  • CloudEvents 1.0 envelopes (JSON structured-mode).
  • Clear topic map & naming.
  • At‑least‑once delivery with idempotency, ordering keys, DLQs, and replay.
  • MassTransit conventions for Azure Service Bus (default) and RabbitMQ (alt).

Event Envelope (CloudEvents)

Transport: JSON (structured mode) over AMQP (ASB) or AMQP 0‑9‑1 (Rabbit); HTTP/gRPC use binary mode when needed.

Required attributes

  • specversion: "1.0"
  • id: string — globally unique per event
  • source: "ecs://{region}/{service}/{resource}" — e.g., ecs://eu-central/config-registry/config-sets/4b...
  • type: "ecs.{domain}.v1.{EventName}" — e.g., ecs.config.v1.ConfigPublished
  • subject: "{tenantId}/{pathOrSetId}" — routing hint for consumers
  • time: RFC3339

Extensions (multi‑tenant & lineage)

  • tenantid: string
  • editionid: string
  • environment: string (dev|test|staging|prod)
  • etag: string
  • version: string (snapshot or semver)
  • correlationid: string (propagated from API)
  • actor: string (sub@iss)
  • region: string (emit region)

Example (structured)

{
  "specversion": "1.0",
  "id": "e-9f1b2f7f-b49e-4f5a-90d0-93a7a6a4b0ef",
  "source": "ecs://eu-central/config-registry/config-sets/4b1c",
  "type": "ecs.config.v1.ConfigPublished",
  "subject": "tnt-7d2a/apps/billing/sets/runtime",
  "time": "2025-08-24T09:12:55Z",
  "tenantid": "tnt-7d2a",
  "editionid": "enterprise",
  "environment": "prod",
  "version": "v1.8.0",
  "etag": "9oM1hQ...",
  "correlationid": "c-77f2...",
  "region": "eu-central",
  "data": {
    "setId": "4b1c...",
    "path": "apps/billing/**",
    "changes": { "breaking": 0, "additive": 3, "neutral": 1 }
  }
}

Topic Map & Naming

Exchange/Topic names (canonical, kebab-case):

  • ecs.config.events — lifecycle of config sets & versions
    • ecs.config.v1.ConfigDraftSaved
    • ecs.config.v1.ConfigPublished
    • ecs.config.v1.ConfigRolledBack
  • ecs.policy.events — policy/edition/schema changes
    • ecs.policy.v1.PolicyUpdated
  • ecs.refresh.events — cache targets & client refresh signals
    • ecs.refresh.v1.CacheInvalidated
    • ecs.refresh.v1.RefreshRequested
  • ecs.adapter.events — external provider sync results
    • ecs.adapter.v1.SyncCompleted
    • ecs.adapter.v1.SyncFailed

Routing/partition keys

  • Primary key: tenantid
  • Secondary (optional): pathPrefix or setId
  • Ensures per‑tenant ordering and hot-tenant isolation.

Queue/Subscription naming (MassTransit)

  • Consumer queues: ecs-{service}-{consumer}-{env} (e.g., ecs-refresh-orchestrator-configpublished-prod)
  • Error/DLQ: {queue}_error, {queue}_skipped (parking lot)
  • Scheduler (delayed): ecs-scheduler (uses ASB/Rabbit delayed delivery plugin where available)

Delivery Semantics

Property Choice Rationale
Delivery At‑least‑once Simpler guarantees; consumers must be idempotent
Ordering Best‑effort global, ordered per (tenantid, key) Partitioning by tenant keeps most flows ordered
Idempotency Required at consumers Dedupe by id (event store) or (tenantid, path, etag)
Visibility Structured CloudEvents Uniform across bus/HTTP
Fan‑out Topic → consumer queues Decoupled consumers; backpressure per queue

Consumer idempotency keys

  • Config state changes: (tenantid, setId, version|etag)
  • Cache invalidation: (tenantid, path, etag)
  • Adapter sync: (tenantid, provider, cursor)

Retries, DLQs, and Parking Lots

Retry policy (MassTransit middleware)

  • Immediate retries: 3
  • Exponential retries: 5 attempts, 2s → 1m (jitter)
  • Circuit‑break: open after 50 failures/60s; half-open after 30s

DLQ

  • Poison messages route to {queue}_error with headers:
    • mt-fault-message, mt-reason, mt-host, mt-exception-type, stacktrace
    • CloudEvents attributes echoed for triage (tenantid, type, id)
  • Parking lot {queue}_skipped for known non‑actionable events (e.g., stale versions), enabling manual replay later.

Operational actions

  • Retry single message: move from DLQ → main queue
  • Bulk reprocess: export DLQ to blob, filter, re‑enqueue via Replay Worker

Replay Strategy

Sources

  1. Outbox/Audit log (authoritative): every emitted event recorded with status=Emitted|Pending.
  2. Broker retention (short): not guaranteed for long windows → rely on outbox.

Replay worker

  • Input: time/tenant filter or cursor range
  • Reads outbox, re‑emits CloudEvents with new id and data.replayOf="<original-id>" (to avoid consumer dedupe drop)
  • Throttled per tenant; writes operator annotations in meta (reason: "replay", ticket: ...)

Consumer expectations

  • Treat replayOf as informational; still dedupe on current id
  • Business logic must tolerate duplicate state transitions

Outbox & Inbox Patterns

Outbox (publisher side, e.g., Config Registry)

  • Within the same transaction as state change:
    • Append Outbox(tenantid, type, subject, data, eventId, status='Pending')
  • Background dispatcher (MassTransit) reads Pending, publishes, marks Emitted
  • Guarantees atomicity between DB write and event publication

Inbox (consumer side, optional)

  • Table Inbox(eventId, consumer, processedAt)
  • Before handling, check presence; after success, insert → ensures exactly‑once processing per consumer

MassTransit Conventions (ASB default, Rabbit alt)

Common

  • EntityNameFormatter = kebab-case
  • Endpoint per message type disabled; use explicit endpoints per service
  • Prefetch tuned per consumer: start = 32–128, adjust by p99
  • ConcurrentMessageLimit per handler: start = CPU cores * 2
  • Observability: OpenTelemetry enabled; activity name = ecs.{messageType}; tags: tenantid, editionid, environment, etype

Azure Service Bus

cfg.UsingAzureServiceBus((context, sb) =>
{
  sb.Host(connStr);
  sb.Message<ConfigPublished>(m => m.SetEntityName("ecs.config.events"));
  sb.SubscriptionEndpoint<ConfigPublished>("ecs-refresh-orchestrator-configpublished-prod", e =>
  {
    e.ConfigureConsumer<ConfigPublishedConsumer>(context);
    e.PrefetchCount = 128;
    e.MaxAutoRenewDuration = TimeSpan.FromMinutes(5);
    e.EnableDeadLetteringOnMessageExpiration = true;
    e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(2), TimeSpan.FromMinutes(1), TimeSpan.FromSeconds(3)));
    e.UseCircuitBreaker(cb => cb.ResetInterval = TimeSpan.FromSeconds(30));
    e.UseInMemoryOutbox(); // plus persistent Outbox at publisher
  });
});

RabbitMQ

cfg.UsingRabbitMq((context, rmq) =>
{
  rmq.Host(host, "/", h => { h.Username(user); h.Password(pass); });
  rmq.Message<ConfigPublished>(m => m.SetEntityName("ecs.config.events"));
  rmq.ReceiveEndpoint("ecs-refresh-orchestrator-configpublished-prod", e =>
  {
    e.Bind("ecs.config.events", x => { x.RoutingKey = "ecs.config.v1.ConfigPublished"; x.ExchangeType = ExchangeType.Topic; });
    e.PrefetchCount = 64;
    e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(60), TimeSpan.FromSeconds(3)));
    e.UseDelayedRedelivery(r => r.Intervals(TimeSpan.FromSeconds(10), TimeSpan.FromSeconds(30)));
    e.UseInMemoryOutbox();
  });
});

Standard Event Contracts (data payloads)

ecs.config.v1.ConfigPublished

{
  "setId": "4b1c...",
  "path": "apps/billing/**",
  "snapshotId": "S-00F",
  "version": "v1.8.0",
  "etag": "9oM1hQ...",
  "policy": { "breaking": false, "violations": [] }
}

ecs.refresh.v1.CacheInvalidated

{
  "scope": { "path": "apps/billing/**", "environment": "prod" },
  "targets": ["redis", "sdk"],
  "reason": "publish",
  "etag": "9oM1hQ..."
}

ecs.adapter.v1.SyncCompleted

{
  "provider": "azure-appconfig",
  "cursor": "w/1724490000",
  "items": 128,
  "changes": { "upserts": 120, "deletes": 8 }
}

End‑to‑End Flow (publish → refresh)

sequenceDiagram
  participant UI as Studio
  participant API as Registry
  participant OB as Outbox
  participant BUS as Event Bus
  participant ORC as Refresh Orchestrator
  participant RED as Redis
  participant SDK as Client SDK

  UI->>API: POST /config-sets/{id}/snapshots (publish)
  API-->>OB: tx write + outbox pending(ConfigPublished)
  API-->>UI: 201 Snapshot/Version
  OB-->>BUS: publish CloudEvent (ConfigPublished)
  BUS-->>ORC: deliver
  ORC->>RED: scoped invalidation
  ORC-->>BUS: publish CacheInvalidated
  BUS-->>SDK: deliver (via WS bridge) / notify
  SDK->>API: Resolve(etag) → 200/304
Hold "Alt" / "Option" to enable pan & zoom

Backpressure & Throttling

  • Per‑tenant quotas (see Gateway cycle); event consumers must:
    • Pause on RESOURCE_EXHAUSTED (MassTransit retry with backoff)
    • Limit in‑flight resolves on refresh storms (coalesce by (tenantid,path) within 500 ms window)
  • Orchestrator maintains bounded buffers; on overflow, drops to parking lot and raises RefreshDegradation alert.

Security

  • Bus credentials scoped to producer/consumer roles (least privilege).
  • CloudEvents actor and tenantid are validated from JWT at publisher; consumers must not trust unvalidated extensions from third parties.
  • Events avoid embedding secret values; only hashes/refs.

Observability & Alerting

Spans

  • ecs.publish, ecs.invalidate, ecs.replay
  • Attributes: tenantid, event.type, queue, delivery.attempt

Metrics

  • Publish success rate, end‑to‑end propagation lag (publish→SDK resolve), DLQ size, replay throughput
  • Consumer handler p95/p99, retries, circuit state

Alerts

  • PropagationLagP95>5s (5m)
  • DLQSize>100 (10m) per queue
  • ReplayFailureRate>1%

Acceptance Criteria (engineering hand‑off)

  • MassTransit bus configuration (ASB + Rabbit) with kebab-case entities, retries, DLQs, OTEL.
  • CloudEvents envelope library (shared) with validation and extensions.
  • Outbox dispatcher and Replay worker with CLI & Studio hooks.
  • Contract tests: idempotency, partition ordering, DLQ/replay, propagation lag budget.
  • Runbooks: DLQ triage, targeted replay, tenant backpressure override.

Solution Architect Notes

  • Start with per‑tenant partitioning; introduce path‑sharded partitions if a few tenants dominate traffic.
  • Consider compaction (e.g., Kafka) when adding an analytics/event‑sourcing lane; current MVP favors ASB/Rabbit simplicity.
  • Validate mobile clients via server‑side WS bridge that consumes ecs.refresh.events and fans out over WebSocket/SSE with CloudEvents.

Refresh & Invalidation Flows — SDK pull/long‑poll/websocket, server push, cache stampede protection, ETags

Objectives

Design the end‑to‑end cache refresh and invalidation path that keeps SDKs and services current within seconds, while protecting the platform from thundering herds and ensuring multi‑tenant isolation.

Outcomes

  • Three client models: periodic pull, long‑poll, WebSocket/SSE push.
  • Server‑side targeted invalidation and coalesced refresh.
  • ETag/conditional fetch as the primary coherency mechanism (304/not_modified).
  • Stampede control at SDK and edge with singleflight + distributed locks and stale‑while‑revalidate.

Control Plane vs Data Plane

Plane Responsibility Tech
Control Propagate change intent CloudEvents (ConfigPublished, CacheInvalidated)
Data Deliver effect (new config) REST GET /resolve (ETag) and gRPC Resolve/ResolveBatch

Principle: Control plane never carries config values; clients always refetch using ETag‑aware reads.


Client Models

1) Periodic Pull (baseline)

  • SDK timer fetches with If-None-Match: <etag> every T seconds (edition‑aware default: Starter=30s, Pro=10s, Enterprise=5s).
  • Pros: simplest, firewall‑friendly.
  • Cons: higher background traffic; longer staleness.
  • SDK calls /resolve?waitForChange=true&timeout=30s (or gRPC Resolve with deadline).
  • Server blocks until new ETag or timeout200 with value or 304 Not Modified.
  • Pros: low idle traffic, near‑real‑time without persistent sockets.

3) WebSocket / SSE Push (premium)

  • SDK subscribes to Refresh Channel (WebSocket or gRPC streaming).
  • Server pushes RefreshEvent(etag, path, cursor); SDK revalidates via Resolve.
  • Pros: sub‑second fanout, very low latency.
  • Cons: long‑lived connections; need heartbeats and resumption.

Flow Diagrams

A) Publish → Fanout → Client Revalidate

sequenceDiagram
  participant Studio as Studio UI
  participant Registry as Config Registry
  participant Orchestrator as Refresh Orchestrator
  participant Redis as Redis Cache
  participant SDK as App SDK

  Studio->>Registry: Publish snapshot (idempotent)
  Registry-->>Redis: DEL tenant:path:* (scoped invalidation)
  Registry->>Orchestrator: emit ConfigPublished(tenant, path, etag)
  Orchestrator-->>SDK: push RefreshEvent(tenant, path, etag) [WS/L.Poll wake]
  SDK->>Registry: Resolve(path, If-None-Match: oldEtag)
  alt New ETag
    Registry-->>SDK: 200 value + ETag(new)
    SDK-->>SDK: Update L1 cache
  else No change
    Registry-->>SDK: 304 Not Modified
  end
Hold "Alt" / "Option" to enable pan & zoom

B) Long‑Poll Resolve (HTTP)

sequenceDiagram
  participant SDK
  participant GW as API Gateway
  participant API as Resolve API
  SDK->>GW: GET /resolve?path=...&waitForChange=true&timeout=30 (If-None-Match: etag)
  GW->>API: forward (deadline=30s)
  API-->>API: await new etag OR timeout (register waiter keyed by tenant+path)
  alt etag changed
    API-->>SDK: 200 value + ETag
  else timeout
    API-->>SDK: 304 Not Modified
  end
Hold "Alt" / "Option" to enable pan & zoom

C) WebSocket Refresh (server push)

sequenceDiagram
  participant SDK
  participant Push as Refresh WS Bridge
  SDK->>Push: WS CONNECT /ws/refresh (x-tenant-id, JWT)
  Push-->>SDK: HEARTBEAT 30s
  Push-->>SDK: RefreshEvent(path, etag, cursor)
  SDK->>Push: ACK cursor (bidi) OR implicit via next pull
  SDK->>API: Resolve(If-None-Match: etag)
Hold "Alt" / "Option" to enable pan & zoom

Server‑Side Invalidation & Coalescing

Targets

  • Primary: Redis keys ecs:{tenant}:{env}:{set}:{path}
  • Secondary: In‑process L2 cache (optional) invalidated by local event bus.

Coalescing

  • The Resolve API maintains a waiter map per (tenant, path):
    • On publish, it wakes all waiters and debounces new waiters for 200–500 ms to batch misses.
  • The Orchestrator emits single refresh signals per (tenant, path, etag) and coalesces bursts within 250 ms windows.

SDK Caching & ETag Strategy

Cache layers

  • L1 (in‑process) with soft TTL and ETag index.
  • L2 (Redis) on server‑side when SDK sits behind a service (optional; SDK still resolves via API).

ETag rules

  • Treat ETag as strong validator of materialized config (secrets still references).
  • Always send If-None-Match on Resolve.
  • Respect not_modified/304 to preserve L1 value and extend soft TTL.

Suggested SDK algorithm (pseudo)

function getConfig(path, ctx):
  key = (tenant, ctx, path)
  entry = L1.get(key)
  if entry && entry.fresh(): return entry.value

  // singleflight per key to prevent stampede
  v = singleflight(key, () => {
      etag = entry?.etag
      resp = Resolve(path, ctx, if_none_match=etag, deadline=250ms)
      if resp.not_modified: 
         entry.touch()
         return entry.value
      else:
         L1.put(key, resp.resolved, resp.etag, soft_ttl(ctx))
         return resp.resolved
  })

  return v

Stampede Protection

Layer Technique Notes
SDK singleflight per (tenant, ctx, path) Collapse concurrent calls inside a process.
API request collapsing + memoization for active resolves Reuse upstream result for waiters.
Redis distributed lock on cache fill (SET NX PX) Timeout ≤ 250 ms; losers get stale‑while‑revalidate.
TTL stale‑while‑revalidate (SWR) Serve stale for ≤ 2s while a single refresher fetches.
Jitter TTL jitter ±10–20% Spread expirations to avoid sync spikes.
Backoff jittered exponential after 429/503 Max backoff 2s for reads.

Negative caching: For consistent 404 resolves, cache NEGATIVE(etag) for short TTL (≤ 5s) to avoid hammering.


Long‑Poll API Contract (REST)

GET /api/v1/resolve?path={p}&env={e}&version=latest&waitForChange=true&timeout=30
Headers:
  If-None-Match: "<etag>"
Responses:
  200 OK            body: value, headers: ETag: "<new>"
  304 Not Modified  (on timeout or same etag)

Server behavior

  • Max timeout = 30s (configurable).
  • Registers waiter; returns 304 on timeout to allow client to re‑issue without breaking caches.

WebSocket/SSE Contract (HTTP)

Handshake

  • GET /ws/refresh (WS) or GET /sse/refresh (SSE) with Authorization and x-tenant-id.

Messages (JSON)

// server -> client
{ "type":"refresh", "cursor":"c-12345", "path":"apps/billing/**", "etag":"9oM1hQ..." }
{ "type":"heartbeat", "ts":"2025-08-24T11:00:00Z" }
{ "type":"nack", "code":"ECS.SUBSCRIPTION.INVALID", "retryAfterMs":2000 }

// client -> server (WS only)
{ "type":"ack", "cursor":"c-12345" }

Resumption

  • Client reconnects with ?resumeAfter=c-<cursor> to avoid gaps; server backfills from replay window.

gRPC Alignment (from Contracts cycle)

  • ResolveRequest.if_none_match_etagResolveResponse.not_modified/etag.
  • RefreshChannel.Subscribe/SubscribeWithAck provide the push lane; SDKs always revalidate with Resolve.

Health, Heartbeats & Timeouts

  • WS/SSE heartbeat every 15–30s; close on 3 missed heartbeats.
  • Long‑poll deadline recommended = timeout + 2s.
  • SDK should rotate tokens before expiry; reconnect on UNAUTHENTICATED/401.

Rate Governance & Flood Safety

  • Gateway applies per‑tenant read QPS limits (edition‑aware).
  • Orchestrator coalesces and drops duplicates within a 250 ms window; emits a single refresh per ETag.
  • Clients must obey Retry-After on 429 and reduce concurrent resolves (cap to 2 in‑flight per process).

Failure Modes & Degradation Paths

Failure Client Behavior Server Behavior
WS bridge down Fallback to long‑poll; increase T by ×2 Auto‑heal; drain buffers; send replayOf on resume
Event bus lag Continue periodic pull; widen interval by +50% Alert; protect Redis with SWR
Redis unavailable Resolve from DB (higher latency) Bypass Redis; apply per‑tenant circuit breaker
ETag mismatch loops Full resolve without If-None-Match once; then resume ETag path Log and re‑materialize canonical cache line
Rate‑limit 429 Jittered backoff (100–500 ms) Emit RateLimitExceeded audit; expose limits in headers

Observability

SDK emits

  • ecs.sdk.resolve.latency (p50/p95/p99), cache.hit/miss, lp.wakeups, ws.reconnects, etag.rotate.count.

Server dashboards

  • Propagation lag (publish → first 200 Resolve)
  • Waiter pool size & coalescing ratio
  • Redis lock wait time & lock contention
  • Long‑poll saturation (% requests returning 304 on timeout)

Acceptance Criteria (engineering hand‑off)

  • Implement waiter registry with safe cancellation; unit tests for coalescing and timeout behavior.
  • Add singleflight utility in SDKs (.NET/JS/Mobile) with per‑key granularity.
  • Redis invalidation with scoped keys and distributed lock on fill.
  • WS/SSE bridge with heartbeats, resume tokens, per‑tenant connection caps.
  • End‑to‑end tests: publish → SDK updates within <5s (Pro/Ent), <15s (Starter) under load, no stampede.

Solution Architect Notes

  • Prefer long‑poll as default mode across SDKs; enable WS behind a feature flag per tenant/edition.
  • Keep SWR≤2s to bound staleness while avoiding spikes.
  • Consider a per‑tenant, per‑path moving window to limit redundant resolves (e.g., ignore duplicate refresh events for 200 ms).
  • For mobile, use SSE when WS is blocked; keep battery impact minimal with adaptive backoff.

Provider Adapter Hub – Plug‑in Model, Contracts & Lifecycles

This section specifies the Provider Adapter Hub that bridges ECS with external configuration backends (Azure App Configuration, AWS AppConfig, Consul, Redis, SQL/CockroachDB). It defines the plug‑in model, capability contracts, adapter lifecycles, multi‑tenant isolation, and operational guarantees so Engineering Agents can implement adapters consistently and safely.


Architectural Context

flowchart LR
  subgraph ECS Core
    Hub[Adapter Hub]
    Registry[Config Registry]
    Policy[Policy Engine]
    Refresh[Refresh Orchestrator]
    EventBus[[Event Bus (CloudEvents)]]
  end

  subgraph Adapters (Out-of-Process Plugins)
    AAZ[Azure AppConfig Adapter]
    AAW[AWS AppConfig Adapter]
    ACS[Consul KV Adapter]
    ARD[Redis Adapter]
    ASQL[SQL/CockroachDB Adapter]
  end

  Hub <-- gRPC SPI --> AAZ
  Hub <-- gRPC SPI --> AAW
  Hub <-- gRPC SPI --> ACS
  Hub <-- gRPC SPI --> ARD
  Hub <-- gRPC SPI --> ASQL

  Registry <--> Hub
  Policy <--> Hub
  Refresh <--> Hub
  EventBus <--> Hub
Hold "Alt" / "Option" to enable pan & zoom

Design stance

  • Out‑of‑process adapters run as isolated containers (per provider) and expose a gRPC Service Provider Interface (SPI) to the Hub.
  • Capability-driven: Each adapter declares supported feature flags (e.g., hierarchical keys, ETag, watch/stream, transactions).
  • Multi‑tenant aware: One adapter process may serve multiple tenants via namespaced bindings with per-tenant credentials.
  • Event-first: All changes propagate as CloudEvents, with at‑least‑once delivery and idempotency keys.

Plug‑in Packaging & Discovery

Aspect Requirement
Image Layout OCI container, label io.connectsoft.ecs.adapter=true with adapter.id, adapter.version, capabilities annotations
SPI Transport gRPC over mTLS (cluster DNS). Optional Unix domain socket for sidecar deployments.
Registration Adapter posts a Manifest to Hub on startup (Register()), Hub persists in Registry.
Upgrades Rolling upgrade supported; Hub revalidates capabilities and drains in‑flight calls.
Tenancy Dynamic Bindings created per tenant/environment via Hub API; Hub passes scoped credentials to adapter using short‑lived secrets.

Adapter Manifest (example)

apiVersion: ecs.connectsoft.io/v1alpha1
kind: AdapterManifest
metadata:
  adapterId: azure-appconfig
  version: 1.4.2
spec:
  provider: AzureAppConfig
  capabilities:
    hierarchicalKeys: true
    etagSupport: true
    watchStreaming: true
    transactions: false
    bulkList: true
    tags: ["labels","contentType"]
  inputs:
    - name: connection
      type: secretRef        # provided by Hub at Bind time
    - name: storeName
      type: string
  limits:
    maxKeySize: 1024
    maxValueSize: 1048576
    maxBatch: 500
  events:
    emits: ["ExternalChangeObserved","SyncJobCompleted"]
    consumes: ["ConfigPublished","BindingRotated"]
security:
  requiredScopes: ["kv.read","kv.write"]
  secretTypes: ["azure:clientCredentials"]

gRPC SPI (Service Provider Interface)

Proto (excerpt)

syntax = "proto3";
package ecs.adapters.v1;

message BindingRef {
  string tenant_id = 1;
  string environment = 2;     // dev|staging|prod
  string namespace = 3;       // app/service scope
  string binding_id = 4;      // unique per tenant+namespace
}

message CredentialEnvelope {
  string provider = 1;        // e.g., AzureAppConfig
  bytes  payload = 2;         // KMS-encrypted secret blob
  string kms_key_ref = 3;
  string version = 4;
  int64  not_after_unix = 5;  // lease expiry
}

message BindRequest {
  BindingRef binding = 1;
  CredentialEnvelope credential = 2;
  map<string,string> options = 3; // e.g., storeName, region, prefix
}

message Key {
  string path = 1;     // normalized ECS path: /apps/{app}/env/{env}/[...]
  string label = 2;    // provider-specific label/namespace
}

message Item {
  Key key = 1;
  bytes value = 2;     // arbitrary bytes (JSON, text, binary)
  string content_type = 3;
  string etag = 4;     // provider etag/version token
  map<string,string> meta = 5;
}

message GetRequest { BindingRef binding = 1; Key key = 2; string if_none_match = 3; }
message GetResponse { Item item = 1; bool not_modified = 2; }

message PutRequest {
  BindingRef binding = 1;
  Item item = 2;
  string if_match = 3;      // CAS on etag
  bool upsert = 4;
  string idempotency_key = 5;
}
message PutResponse { Item item = 1; }

message ListRequest { BindingRef binding = 1; string prefix = 2; int32 page_size = 3; string page_token = 4; }
message ListResponse { repeated Item items = 1; string next_page_token = 2; }

message DeleteRequest { BindingRef binding = 1; Key key = 2; string if_match = 3; string idempotency_key = 4; }
message DeleteResponse {}

message WatchRequest { BindingRef binding = 1; string prefix = 2; string resume_token = 3; }
message ChangeEvent {
  string change_id = 1;     // for idempotency
  string type = 2;          // upsert|delete
  Item item = 3;
  string resume_token = 4;  // bookmark for stream resumption
}

service ProviderAdapter {
  rpc RegisterManifest(google.protobuf.Empty) returns (AdapterInfo);
  rpc Bind(BindRequest) returns (BindingAck);
  rpc Validate(BindRequest) returns (ValidationResult);

  rpc Get(GetRequest) returns (GetResponse);
  rpc Put(PutRequest) returns (PutResponse);
  rpc Delete(DeleteRequest) returns (DeleteResponse);
  rpc List(ListRequest) returns (ListResponse);

  rpc Watch(WatchRequest) returns (stream ChangeEvent); // server-streamed
  rpc Health(google.protobuf.Empty) returns (HealthStatus);
  rpc Unbind(BindingRef) returns (google.protobuf.Empty);
}

Contract notes

  • Idempotency: idempotency_key required for Put/Delete. Adapters must persist idempotency window (e.g., 24h) to dedupe retries.
  • ETag/CAS: Use if_match/if_none_match for concurrency control where provider supports it; otherwise emulate with version vectors.
  • Pagination: mandatory stable ordering by key path; opaque page_token.

Binding & Namespacing

Each tenant/environment maps to a provider namespace:

Provider Namespace Strategy
Azure AppConfig key = {prefix}:{app}:{env}:{path}, label = {namespace} (labels for environment or slot)
AWS AppConfig Application/Profile/Environment map to ECS namespace/env; versions map to deployments
Consul KV /{tenant}/{env}/{namespace}/{path} with ACL tokens per binding
Redis key = ecs:{tenant}:{env}:{namespace}:{path}, optional hash for grouping; TTL not used for config
CockroachDB (SQL) Table per tenant partitioned by {env, namespace} with composite PK (key, version)

Key normalization

  • ECS canonical path: /apps/{app}/env/{env}/services/{svc}/paths/...
  • Adapters implement deterministic mapping and store reverse mapping in meta to support round‑trip and diffs.

Adapter Lifecycle (State Machine)

stateDiagram-v2
  [*] --> Discovered
  Discovered --> Registered: RegisterManifest()
  Registered --> Bound: Bind()+Validate()
  Bound --> Active: Health(ok)
  Active --> RotatingCreds: BindingRotated
  RotatingCreds --> Active: re-Validate()
  Active --> Draining: Unbind() requested / Upgrade
  Draining --> Unbound: no inflight
  Unbound --> [*]
  Active --> Degraded: Health(warn/fail)
  Degraded --> Active: Auto-recover/Retry
  Degraded --> Draining: Manual intervention
Hold "Alt" / "Option" to enable pan & zoom

Lifecycle hooks

  • Validate() must perform least‑privilege checks and a dry‑run (list one key, check write if permitted).
  • Unbind() drains Watch streams and completes in-flight mutations; Hub retries if deadlines expire.

Change Propagation & Sync

Flows

  1. ECS → Provider (Publish) Registry emits ConfigPublished → Hub resolves bindings → Put() batch to adapter → on success, Hub emits SyncJobCompleted.

  2. Provider → ECS (Observe) Adapter Watch(prefix) streams ChangeEvent for external drifts → Hub translates to ExternalChangeObserved → Policy Engine validates → Registry applies or flags drift.

sequenceDiagram
  participant Registry
  participant Hub
  participant Adapter as AzureAdapter
  participant Provider as AzureAppConfig

  Registry->>Hub: ConfigPublished(tenant/app/env)
  Hub->>Adapter: Put(batch, idempotency_key)
  Adapter->>Provider: PUT keys (CAS/etag)
  Provider-->>Adapter: 200 + etags
  Adapter-->>Hub: PutResponse(items)
  Hub-->>Registry: SyncJobCompleted
Hold "Alt" / "Option" to enable pan & zoom

Delivery semantics

  • At‑least‑once end‑to‑end with idempotency on adapter side.
  • Backpressure: Hub enforces max in‑flight per binding and exponential backoff per provider rate limits.
  • DLQ: Failed mutations or change translations publish to ecs.adapters.dlq with replay tokens.

Diffing, Snapshot, Rollback

  • List(prefix) yields provider snapshot; Hub computes 3‑way diff (desired, provider, last‑applied).
  • Rollback uses ECS Versioning to restore prior desired state; Hub re‑publishes with CAS on previous etags where possible.
  • Partial capability: If provider lacks ETag, Hub marks binding non‑concurrent, serializes writes.

Security & Secrets

Concern Pattern
Credentials Short‑lived CredentialEnvelope issued by Hub via cloud KMS/KeyVault/Secrets Manager; rotate via BindingRotated event.
Network mTLS between Hub↔Adapter; egress to provider via provider’s SDK with TLS1.2+.
Least Privilege Scoped roles per binding (read-only vs read/write).
Auditing All SPI calls include traceId, tenantId, actor; adapters must log structured spans.
Data Privacy No plaintext secrets in logs; PII scrubbing middleware in adapters.

Observability & Health

Metrics (per binding)

  • adapter_requests_total{op}
  • adapter_request_duration_ms{op}
  • adapter_throttle_events_total
  • adapter_watch_gaps_total (stream gaps/restarts)
  • adapter_idempotent_dedup_hits_total
  • adapter_errors_total{type=auth|rate|transient|fatal}

Health

  • HealthStatus returns status, since, and checks[] including auth, reachability, quota.

Provider‑Specific Nuances

Azure App Configuration

  • Use label for env/namespace, contentType for metadata.
  • Leverage If‑None‑Match for conditional GET; ETag on PUT/CAS.
  • Watch via push notifications (Event Grid) optional; otherwise adapter polling with ETag bookmarks.

AWS AppConfig

  • Publishing often uses Hosted Config with Deployments; adapter wraps version promotion rather than per‑key PUT.
  • Changes are profile+environment scoped; adapter maps ECS keys to a JSON document (schema‑validated).

Consul KV

  • Native blocking queries support; adapter implements Watch using index token (long‑poll).
  • ACL tokens per binding; transactions (if enabled) can apply atomic batches.

Redis

  • Use keyspace notifications where available; otherwise use SCAN + version fields.
  • Prefer HASH records for grouped configuration with a version field to emulate ETag.

SQL/CockroachDB

  • Schema (simplified):
create table config_items (
  tenant_id    uuid not null,
  environment  text not null,
  namespace    text not null,
  path         text not null,
  version      int8 not null default 1,
  content_type text,
  value        bytea not null,
  etag         text not null,
  updated_at   timestamptz not null default now(),
  primary key (tenant_id, environment, namespace, path)
);
create index on config_items (tenant_id, environment, namespace);
  • ETag computed as sha256(value||version); changes streamed via CDC if enabled.

Error Model

Code Meaning Hub Behavior
ALREADY_EXISTS CAS conflict (etag mismatch) Retry with latest Get() and policy merge
FAILED_PRECONDITION Invalid binding/perm Mark binding degraded, alert Security
RESOURCE_EXHAUSTED Throttled/quota Backoff (jitter), adjust batch size
UNAVAILABLE Provider outage Circuit break per binding; queue for replay
DEADLINE_EXCEEDED Timeout Retry with exponential backoff; consider reducing page/batch

All responses must include retry_after_ms hint where applicable.


Rate Limits & Concurrency

  • Hub defaults: 8 concurrent ops per binding, 256 per adapter process.
  • Adapter declares provider limits in Manifest; Hub tunes concurrency dynamically (token bucket).
  • Large publications use chunking with maxBatch from Manifest.

Compliance & Governance

  • Adapters must pass conformance tests:
    • CRUD contract, idempotency, ETag/CAS behavior
    • Watch stream continuity (resume token)
    • Backpressure under synthetic throttling
    • Multi‑tenant isolation (no cross‑leak with wrong credentials)
  • Supply SBOM and sign container images (Sigstore).

Example: Binding Creation (REST)

POST /v1/adapters/azure-appconfig/bindings
Content-Type: application/json
{
  "tenantId": "t-123",
  "environment": "prod",
  "namespace": "billing",
  "options": { "storeName": "appcfg-prod", "prefix": "ecs" },
  "credentialRef": "kv://secrets/azure/appcfg/billing-prod",
  "permissions": ["read","write"]
}

Hub resolves credentialRef → creates CredentialEnvelope → calls Bind() then Validate() on adapter.


Testing Matrix (Adapter Conformance)

Area Scenario Expected
Idempotency Retry Put() with same idempotency_key Exactly one write at provider
CAS Put(if_match=stale_etag) ALREADY_EXISTS, no mutation
Watch Kill adapter, restart with resume_token Stream resumes without loss
Pagination List() with small page_size Complete coverage, stable ordering
Drift Provider side key modified ExternalChangeObserved fired
Throttle Provider returns 429/limit Backoff, no hot‑loop

Solution Architect Notes

  • Sidecar vs central adapter: For high‑churn namespaces, deploy adapter sidecar to reduce latency; otherwise run shared adapters per provider with strong isolation at binding layer.
  • Schema guarantees: Where providers are document‑style (AWS AppConfig), define a document aggregator in Hub that flattens ECS keyspace into a single JSON with stable ordering and checksum for CAS.
  • Rollout safety: Pair publication with Refresh Orchestrator to gradually push SDK refresh (canary → region → all).
  • Cost & limits: Respect provider rate/quota; tune batch sizes from Manifest during runtime using feedback metrics.

This specification enables Engineering Agents to implement provider adapters with consistent contracts, safe lifecycles, and operational reliability across heterogeneous backends, while preserving ECS’s multi‑tenant isolation and event‑driven change propagation.


SDK Design (.NET / JS / Mobile) – APIs, Caching, Offline, Diagnostics, Feature Flags, Circuit Breakers

This section defines the client SDKs for ECS across .NET, JavaScript/TypeScript, and Mobile (MAUI / React Native). It translates platform-neutral behaviors (auth, tenant routing, caching, refresh, and resiliency) into implementable APIs with consistent cross-language semantics. SDKs are multi-tenant aware, edition aware, and aligned with Clean Architecture and observability-first principles.


Goals & Non-Functional Targets

Area Target
Latency p95 GetConfig() ≤ 50ms (warm cache), ≤ 250ms (cache miss, intra-region)
Availability 99.95% SDK internal availability (local cache + backoff)
Cache Correctness Strong read-your-writes for same client session; eventual consistency ≤ 2s with server push/long-poll
Footprint < 400KB minified JS; < 1.5MB .NET DLLs; < 800KB mobile binding (excl. deps)
Telemetry OTEL spans for API calls; metrics for hit/miss, refresh latency, circuit state
Security OIDC access token; least-privilege scopes; mTLS optional for enterprise
Backwards-Compat SemVer with non-breaking minor releases; feature flags guarded by capability negotiation

High-Level Architecture

classDiagram
    class EcsClient {
        +GetConfig(key, options) : ConfigValue
        +GetSection(path, options) : ConfigSection
        +Subscribe(keys|paths, handler) : SubscriptionId
        +SetContext(Tenant, Environment, App, Edition)
        +FlushCache(scope?)
        +Diagnostics() : EcsDiagnostics
    }

    class Transport {
        <<interface>>
        +RestCall()
        +GrpcCall()
        +WebsocketStream()
        +LongPoll()
    }

    class CacheLayer {
        +TryGet(key) : CacheResult
        +Put(key, value, etag, ttl)
        +Invalidate(key|prefix)
        +PromoteToPersistent()
    }

    class OfflineStore {
        +Read(key) : OfflineResult
        +Write(snapshot)
        +Enumerate()
        +GC(policy)
    }

    class Resiliency {
        +WithRetry()
        +WithCircuitBreaker()
        +WithTimeouts()
        +WithJitterBackoff()
    }

    class FeatureFlags {
        +IsEnabled(flag, context) : bool
        +Variant(flag, context) : string|number
    }

    EcsClient --> Transport
    EcsClient --> CacheLayer
    EcsClient --> OfflineStore
    EcsClient --> Resiliency
    EcsClient --> FeatureFlags
Hold "Alt" / "Option" to enable pan & zoom

Public API Surface (Cross-Language Semantics)

Core Concepts

  • Context: { tenantId, environment, applicationId, edition }
  • Selector: key (exact) or path (hierarchical section).
  • Consistency Options: { preferCache, mustBeFreshWithin, staleOk, etag }
  • Subscription: Push updates for keys/paths via WS/long-poll.

.NET (C#)

public sealed class EcsClient : IEcsClient, IAsyncDisposable
{
    Task<ConfigValue<T>> GetConfigAsync<T>(string key, GetOptions? options = null, CancellationToken ct = default);
    Task<ConfigSection> GetSectionAsync(string path, GetOptions? options = null, CancellationToken ct = default);
    IDisposable Subscribe(ConfigSubscription request, Action<ConfigChanged> handler);
    void SetContext(EcsContext ctx);
    Task FlushCacheAsync(CacheScope scope = CacheScope.App, CancellationToken ct = default);
    EcsDiagnostics Diagnostics { get; }
}

public record GetOptions(TimeSpan? MustBeFreshWithin = null, bool PreferCache = true, bool StaleOk = false, string? ETag = null);
public record EcsContext(string TenantId, string Environment, string ApplicationId, string Edition);
public record ConfigValue<T>(T Value, string ETag, DateTimeOffset AsOf, string Source); // source: memory/persistent/network

SDK Registration

services.AddEcsClient(o =>
{
    o.BaseUrl = new Uri("https://api.ecs.connectsoft.io");
    o.Auth = AuthOptions.FromOidc(clientId: "...", authority: "...", scopes: new[] {"ecs.read"});
    o.Transport = TransportMode.GrpcWithWebsocketFallback;
    o.Cache = CacheOptions.Default with { MemoryTtl = TimeSpan.FromMinutes(5), Persistent = true };
    o.Resiliency = ResiliencyProfile.Standard;
});

JavaScript / TypeScript

import { EcsClient, createEcsClient, GetOptions } from "@connectsoft/ecs";

const ecs: EcsClient = createEcsClient({
  baseUrl: "https://api.ecs.connectsoft.io",
  auth: { oidc: { clientId: "...", authority: "...", scopes: ["ecs.read"] } },
  transport: { primary: "grpc-web", fallback: ["rest", "long-poll"] },
  cache: { memoryTtlMs: 300000, persistent: true },
  resiliency: "standard"
});

const value = await ecs.getConfig<string>("features/paywall/threshold", <GetOptions>{ mustBeFreshWithinMs: 1000 });

const unsub = ecs.subscribe({ keys: ["features/*"] }, change => {
  console.log("config changed", change);
});

Mobile (MAUI / React Native)

  • MAUI: NuGet package reuses .NET SDK; persistent store via SecureStorage + local DB.
  • React Native: NPM package uses AsyncStorage/SecureStore for persistent cache; push via WS.

Caching Strategy

Two-tier cache with ETag-aware coherency and stampede protection.

Layer Backing Default TTL Notes
Memory Concurrent in-proc map with version pins 5m Fast path; per-key locks to prevent thundering herd
Persistent Indexed DB (Web) / AsyncStorage (RN) / LiteDB/SQLite (MAUI) 24h (stale-read allowed) Used when offline; writes are snapshots (configSet, etag, asOf)

Get Flow (with ETag/Stampede Guard)

sequenceDiagram
    participant App
    participant SDK as EcsClient
    participant Cache as Memory/Persistent
    participant Svc as ECS API

    App->>SDK: GetConfig(key, opts)
    SDK->>Cache: TryGet(key)
    alt Hit & Fresh
        Cache-->>SDK: value, etag
        SDK-->>App: value (source=memory)
    else Miss or Stale
        SDK->>SDK: Acquire per-key lock
        SDK->>Svc: GET /configs/{key} If-None-Match: etag
        alt 304 Not Modified
            Svc-->>SDK: 304
            SDK->>Cache: Touch(key)
            SDK-->>App: value (source=cache)
        else 200 OK
            Svc-->>SDK: value, etag
            SDK->>Cache: Put(key, value, etag, ttl)
            SDK-->>App: value (source=network)
        end
        SDK->>SDK: Release lock
    end
Hold "Alt" / "Option" to enable pan & zoom

Options

  • PreferCache: return cached immediately; refresh in background.
  • MustBeFreshWithin: ensure data age ≤ threshold; else block for refresh.
  • StaleOk: allow stale if network unavailable (offline mode).

Offline Mode

  • Detection: transport health + browser navigator.onLine/platform reachability signal.
  • Read: return most recent snapshot from persistent store if StaleOk or MustBeFreshWithin not satisfied but offline.
  • Write (Studio-authored server-side): SDKs are read-optimized; write APIs guarded and not available in runtime SDKs by default.
  • Reconciliation: on reconnect, diff by ETag and update persistent store; emit ConfigChanged events for subscribers.
flowchart LR
    Offline[[Offline]] --> ReadFromPersistent
    Reconnect[[Reconnect]] --> SyncETags --> UpdateMemory --> NotifySubscribers
Hold "Alt" / "Option" to enable pan & zoom

Refresh & Invalidation

  • Push: WS stream ConfigChanged (per tenant/app/keys). SDK updates memory and persistent stores; raises handlers on the UI thread (mobile) / event loop (JS).
  • Long-Poll: Backoff schedule with jitter; ETag aggregation checkpoint to minimize payload.
  • Manual: FlushCache(scope) clears memory and persistent entries by key, prefix, or app.

Subscriber Contract

type ConfigSubscription = { keys?: string[], paths?: string[], prefix?: string };
type ConfigChanged = { key: string, newETag: string, reason: "push"|"poll"|"manual" };

Diagnostics & Telemetry

OpenTelemetry built-in, disabled only via explicit opt-out.

Signal Name Dimensions
Trace span ecs.sdk.get_config tenant, app, key, cache_hit(bool), etag_present(bool), attempt
Metric (counter) ecs_cache_hits_total layer(memory/persistent), tenant, app
Metric (histogram) ecs_refresh_latency_ms transport(rest/grpc/ws), outcome
Metric (gauge) ecs_circuit_state service, state
Log ecs.sdk.error exception, httpStatus, circuitState, retryAttempt

Redaction: keys logged with hashing; values never logged unless Diagnostics().EnableValueSampling() is set in dev only.

User-Agent: ConnectSoft-ECS-SDK/{lang}/{version} ({os};{runtime})


Feature Flags API

Integrated lightweight evaluation with remote rules (from ECS) + local fallbacks.

public interface IEcsFlags
{
    bool IsEnabled(string flag, FlagContext? ctx = null);
    T Variant<T>(string flag, FlagContext? ctx = null, T defaultValue = default!);
}

public record FlagContext(string? UserId = null, string? Region = null, IDictionary<string,string>? Traits = null);
  • Data Source: features/* namespace within ECS.
  • Evaluation: client-side deterministic hashing (sticky bucketing), server-side override precedence.
  • Safety: if no rules available, fail-closed for IsEnabled (configurable).

Resiliency Patterns

Defaults provided by ResiliencyProfile.Standard (overridable per method).

Concern Default
Timeouts connect: 2s, read: 1.5s, overall: 4s
Retries 3 attempts, exponential backoff (100–800ms) + jitter
Circuit Breaker consecutive failures ≥ 5 or failure rate ≥ 50% over 20 calls → Open 30s; Half-Open probe: 10%
Bulkhead Max 64 concurrent inflight network calls per process
Fallback Serve stale if available; else throw EcsUnavailableException

Circuit State Events Handlers may subscribe to OnCircuitChange(service, state, reason) for operational visibility.


Tenant & Edition Routing

  • Every request carries Context headers: x-ecs-tenant, x-ecs-environment, x-ecs-application, x-ecs-edition.
  • Edition shaping: SDK hides calls to non-entitled endpoints (pre-check via capability discovery at init).
  • Partition-Aware: Affinity to nearest region discovered via bootstrap (DNS SRV / discovery endpoint); persisted for session.

Configuration & Bootstrapping

Setting .NET Key JS Key Default
Base URL Ecs:BaseUrl baseUrl required
Transport Ecs:Transport transport.primary grpc (.NET), grpc-web (web)
Auth Ecs:Auth:* auth OIDC
Memory TTL Ecs:Cache:MemoryTtlSeconds cache.memoryTtlMs 300
Persistent Ecs:Cache:Persistent cache.persistent true
Push Ecs:Refresh:PushEnabled refresh.push true
Long-Poll Ecs:Refresh:PollIntervalMs refresh.pollIntervalMs 30000
Resiliency Ecs:Resiliency:Profile resiliency standard

Error Model (Surface to Callers)

  • EcsAuthException (401/403)
  • EcsNotFoundException (404 key/path)
  • EcsConcurrencyException (ETag precondition failed)
  • EcsUnavailableException (circuit open / transport down; Inner includes last cause)
  • EcsTimeoutException
  • EcsValidationException (bad selector/context)

Each includes: traceId, requestId, correlationId, retryAfter (if applicable).


Packaging, Versioning, Governance

Lang Package Min Runtime SemVer
.NET ConnectSoft.Ecs .NET 8 MAJOR.MINOR.PATCH
JS/TS @connectsoft/ecs ES2019+ Same
RN @connectsoft/ecs-react-native RN 0.74+ Same
MAUI Uses .NET package .NET 8 MAUI Same
  • Capability Negotiation at init: server advertises supported transports and features → SDK enables/guards accordingly.
  • Deprecations: compile-time [Obsolete] (.NET) / JSDoc tags (JS) + runtime warning with link to migration docs.

Security Considerations

  • Tokens stored in OS-backed secure stores (Keychain/DPAPI/Keystore).
  • PII-safe telemetry; opt-in only for value sampling.
  • mTLS support via client cert injection (.NET/MAUI only, enterprise edition).
  • Scoped Claims enforced in gateway; SDK only forwards tokens.

Testability & Observability Hooks

  • Deterministic clocks and pluggable time providers for cache TTL tests.
  • Transport shim interface for mocking HTTP/gRPC.
  • In-memory store to simulate offline for E2E.
  • DiagnosticListener (.NET) / ecs.on(event, handler) (JS) to observe lifecycle events.

Example Usage Patterns

Config with Freshness Guard & Fallback

var config = await ecs.GetConfigAsync<int>("limits/maxSessions",
    new GetOptions(MustBeFreshWithin: TimeSpan.FromSeconds(1), PreferCache: true, StaleOk: true), ct);

Feature Flag Toggle in UI

if (ecs.flags.isEnabled("ui:newOnboarding", { userId })) {
  renderNewFlow();
} else {
  renderClassic();
}

Subscription to Section Changes

using var sub = ecs.Subscribe(
    new ConfigSubscription { Prefix = "features/" },
    change => logger.LogInformation("Changed: {Key} {ETag}", change.Key, change.NewETag));

Solution Architect Notes

  • Docs to add: per-language quickstarts, migration guide for transport changes, capability negotiation matrix by edition.
  • Optional extension: pluggable policy evaluators in SDK to locally enforce tenant policy hints (read-only apps, canary % caps).
  • Performance harness: include standard bench suite for cache hit/miss paths using recorded timelines from staging.

Config Studio UI — IA, Screens, Workflows, Policy Editor, Guardrails, Approvals, Audit Trails

Objectives

Deliver an admin-first Studio to author, validate, approve and publish tenant configuration safely. The Studio is a single-page app (SPA) backed by the ECS Gateway/YARP (BFF), enforcing tenant/edition/env scopes and segregation of duties with full auditability.


Information Architecture (IA)

flowchart LR
  root[Studio Home]
  Ten[Tenant Switcher]
  Env[Environment Switcher]
  Dash[Dashboard]
  CS[Config Sets]
  Items[Items]
  Snaps[Snapshots & Tags]
  Dep[Deployments]
  Pol[Policy Editor]
  App[Approvals]
  Aud[Audit & Activity]
  Ops[Operations: Backups/Imports]
  Acc[Access & Roles]

  root-->Dash
  root-->CS-->Items
  CS-->Snaps
  Snaps-->Dep
  root-->Pol
  root-->App
  root-->Aud
  root-->Ops
  root-->Acc
  root-->Ten
  root-->Env
Hold "Alt" / "Option" to enable pan & zoom

Global chrome

  • Tenant switcher (only tenants user is entitled to).
  • Environment switcher (dev/test/staging/prod).
  • Edition badge (Free/Pro/Enterprise) with effective policy overlays.
  • Context pills: tenant / application / namespace / env.

Role-Based UX & Guarded Actions

Role Primary Views Write Capabilities Guardrails
Viewer Dashboard, Config Sets, Snapshots, Audit None Cannot reveal secret values; masked view only
Developer + Items, Diff, Local Preview Draft edits, save, request approval Schema validation must pass; breaking changes require approval
Approver Approvals, Policy summaries Approve/Reject, schedule SoD: cannot approve own changes; change window enforcement
Tenant Admin + Policies, Access, Ops Tagging, alias pins, rollback, import/export Two-person rule for prod; high-risk actions gated
Platform Admin Cross-tenant operations Break-glass overrides Time-boxed, audited with ticket reference

Key Workflows

Draft → Validate → Approve → Publish → Deploy → Refresh

sequenceDiagram
  participant Dev as Developer
  participant Studio as Studio UI
  participant API as ECS API (BFF)
  participant Policy as Policy Engine
  participant Approver as Approver
  participant Registry as Config Registry
  participant Orchestrator as Refresh Orchestrator

  Dev->>Studio: Edit draft (items)
  Studio->>API: PUT /config-sets/{id}/items (draft)
  Dev->>Studio: Validate
  Studio->>Policy: POST /policies/validate (draft+context)
  Policy-->>Studio: Result (violations, riskScore)
  Dev->>Studio: Request approval
  Studio->>API: POST /approvals (changeSet, riskScore)
  Approver->>Studio: Review diff, policy summary
  Approver->>API: POST /approvals/{req}:approve (window, notes)
  API->>Registry: POST /config-sets/{id}/snapshots (Idempotency-Key)
  Registry-->>Studio: Snapshot created (etag, version)
  API->>Registry: POST /deployments (env, canary?)
  Registry->>Orchestrator: ConfigPublished
  Orchestrator-->>Studio: Deployment status & refresh signal
Hold "Alt" / "Option" to enable pan & zoom

Decision points

  • Risk scoring (see Guardrails) drives required approvals count.
  • Change windows block immediate publish to prod if outside permitted window.

Screens & Routes

Screen Purpose Route Key Backend Calls
Dashboard Tenant health, recent changes, pending approvals /dashboard GET /config-sets?limit=5, GET /approvals?status=pending
Config Sets List/search sets; create /config-sets GET /config-sets, POST /config-sets
Items Editor Edit key/values with schema hints /config-sets/:id/items GET /config-sets/{id}/items, PUT /items/{key}, POST /items:batch
Diff & Validation Compare working vs snapshot; policy violations /config-sets/:id/diff POST /diff, POST /policies/validate
Snapshots & Tags View versions, tag SemVer, manage aliases /config-sets/:id/snapshots GET /snapshots, POST /snapshots, PUT /tags, PUT /aliases/*
Deployments Create/monitor deployments; canary waves /deployments POST /deployments, GET /deployments
Approvals Inbox Review requests; approve/reject with notes /approvals GET /approvals, `POST /approvals/{id}:approve reject`
Policy Editor Edit rules, schemas, edition overlays /policies GET/PUT /policies, GET/PUT /schemas
Audit & Activity Tamper-evident timeline, filters, export /audit GET /audit?filter=..., export
Operations Import/export, backups, snapshots /ops POST /snapshots/export, POST /import
Access & Roles Assign roles; view effective permissions /access GET/PUT /roles, GET /whoami

Items Editor — UX Details

  • Tree + Table hybrid: hierarchical left nav (paths), key table with inline type detection.
  • Content types: JSON (editor with schema-aware IntelliSense), YAML, Text.
  • Validation overlay: inline markers (red=blocking, amber=warning).
  • Secrets: write-only fields using secret references (kvref://…), never show plaintext.
  • ETag awareness: editor surfaces server ETag; PATCH requires If-Match.

Policy Editor

Modes

  1. Visual Builder: conditions, targets, effects (allow/deny/transform).
  2. Code View: DSL/JSON with schema, auto-format & lint.
  3. Test Bench: input context (tenant, env, labels) → evaluate → see result and trace.

Artifacts

  • JSON Schema per namespace/path; used for editor IntelliSense & validation.
  • Edition Overlays: Pro/Enterprise feature gates, default limits, quotas.
  • Change Windows: cron-like rules per environment.

Sample Policy (DSL excerpt)

rule: deny-breaking-changes-in-prod
when:
  env: "prod"
  change.severity: "breaking"
then:
  deny: true
  message: "Breaking changes require CAB approval"

Guardrails & Risk Scoring

Guardrail Description Severity Enforcement
Schema violations JSON Schema errors Blocking Must fix before approval
Breaking change Remove/rename keys High Requires 2 approvals + change window
Secrets handling Plaintext secret detected High Block; enforce kvref usage
Blast radius Affects >N services Medium/High Extra approver or canary required
Change budget Too many changes per window Medium Auto-suggest canary
Out-of-window publish Not in allowed window Medium Schedule or request override

Risk Score = Σ(weight × finding)Approval policy:

  • Score < 20 → 1 approver (owner/lead)
  • 20–49 → 2 approvers (different role/team)
  • ≥ 50 → CAB (admin + approver; SoD enforced)

Approvals Model

  • SoD: approver cannot be the author; cannot share same group if risk ≥ 20.
  • Delegations: time-boxed with reason & ticket link.
  • Scheduling: approval can schedule publish/deploy in a future change window.
  • Evidence: approval captures diff, policy results, tests, and risk score snapshot.

Approval View

  • Left: Diff summary grouped by severity.
  • Right: Policy findings, checks passed/failed, required approvers checklist.
  • Footer: Approve/Reject with notes; Require Canary toggle.

Audit Trails

Tamper-evident timeline

  • Events hash-chained; show chain integrity indicator.
  • Filters: actor, resource, action, risk, status, env, time window.
  • Drill-down to raw CloudEvent and outbox record.

Exports

  • CSV/Parquet; signed export manifest with hash of contents.
  • Redaction presets (no tenant PII, no secret refs).

Typical events

  • DraftEdited, ValidationRun, ApprovalRequested/Granted/Rejected, SnapshotCreated, DeploymentStarted/Completed, RollbackInitiated, PolicyUpdated, AccessChanged.

Accessibility, Usability & Internationalization

  • WCAG 2.1 AA: keyboard-first navigation, visible focus, ARIA roles.
  • i18n: ICU message format; RTL support; date/number locale aware.
  • Dark mode with contrast ≥ 7:1 for critical toasts/errors.

UI Architecture & Runtime

flowchart LR
  SPA[Config Studio SPA]
  BFF[YARP / Studio BFF]
  API[Registry/Policy APIs]
  BUS((Event Bus))
  WS[WS Bridge]

  SPA--REST/gRPC-->BFF
  BFF--mTLS-->API
  SPA--WS/SSE-->WS
  API--CloudEvents-->BUS
  WS--subscribes-->BUS
Hold "Alt" / "Option" to enable pan & zoom
  • State management: RTK Query (web) / Recoil; optimistic updates for drafts.
  • Transport: gRPC-web for reads; REST for writes with Idempotency-Key.
  • Security: OIDC PKCE; refresh tokens stored in secure storage; CSP locked; SRI for static assets; iframe embedding denied.

Error Handling & Toaster Patterns

  • Blocking errors bubble to Problem Details panel with traceId.
  • Non-blocking validation issues remain inline with links to policy docs.
  • Retry affordances for transient (429/503) with backoff hint.

Telemetry & Diagnostics

  • User actionsecs.studio.action (create-draft, request-approval, approve, publish).
  • Latency → editor validate, diff compute, publish end-to-end.
  • UX health → WS reconnects, long-poll timeouts, save conflicts (412s).
  • Session correlationx-correlation-id propagated to API and CloudEvents.

Performance Targets

  • Diff compute ≤ 300 ms for 5k-key sets (WebAssembly json-patch optional).
  • Validation roundtrip ≤ 500 ms p95.
  • Page initial load < 2.5 s on 3G fast, TTI < 4 s.

Acceptance Criteria (engineering hand-off)

  • Routes and components scaffolded with guards per role.
  • Items editor with schema-aware linting, secret reference input, ETag display.
  • Policy editor (visual + code) + test bench with recorded contexts.
  • Approvals workflow with SoD, risk scoring, change windows, and scheduling.
  • Audit timeline with hash-chain indicator and export.
  • End-to-end tests for: validate→approve→publish→deploy; rollback with approvals; out-of-window scheduling.

Solution Architect Notes

  • Prefer server-driven UI metadata (schemas, overlays, limits) to keep Studio thin and edition-aware without redeploys.
  • Introduce “Safe Preview” mode: materialize resolved config for a sandbox service to validate integration before publish.
  • Consider conflict-free replicated drafts (CRDT) if real-time multi-editor collaboration becomes a requirement.

Policy & Governance — schema validation, rules engine (edition/tenant/env), approvals, SoD, change windows

Objectives

Define the authoritative policy system that governs how configuration is authored, validated, approved, and published across tenants, editions, and environments. Provide implementable contracts for schema validation, rules evaluation, approvals/SOD, and change windows, with deterministic enforcement points and full auditability.


Architecture Overview

flowchart LR
  Studio[Config Studio UI]
  GW[API Gateway / Envoy]
  Policy[Policy Engine (PDP)]
  Registry[Config Registry]
  Approvals[Approvals Service]
  Bus[(Event Bus)]

  Studio -- validate/diff --> Policy
  GW -- ext_authz --> Policy
  Registry -- pre-publish validate --> Policy
  Policy -- obligations --> GW
  Policy -- CloudEvents --> Bus
  Studio --> Approvals
  Approvals --> Registry
Hold "Alt" / "Option" to enable pan & zoom

Principles

  • Policy-as-code with signed, versioned bundles.
  • Deterministic evaluation order and idempotent outcomes.
  • Zero-trust: every sensitive operation checked at the edge or server via PDP.
  • Observability-first: decisions are explainable (inputs, rules, obligations).

Policy Artifacts

Artifact Purpose Format Storage
JSON Schema Structural validation of config items/sets JSON Schema 2020-12 Registry (per namespace), cached
Rule bundles Declarative allow/deny/obligate decisions YAML/JSON (ECS DSL) or Rego (OPA mode) Policy repo → Policy Engine
Edition Overlays Feature gates, limits per plan YAML/JSON Registry/Policy
Approval Policies Required approvers, SoD, thresholds YAML Policy Engine
Change Windows Allowed publish/deploy windows cron + TZ (IANA) Policy Engine
Risk Scoring Weighted findings → required gates YAML Policy Engine

The engine supports native ECS DSL (recommended) and OPA/Rego as a compatibility mode. Only one mode enabled per deployment.


Evaluation Model

Inputs (decision context)

  • tenantId, editionId, environment, appId, namespace
  • actor { sub, roles[], groups[], mfa, isBreakGlass }
  • operation (e.g., config.publish, config.modify, policy.update)
  • change (diff summary: breaking/additive/neutral; counts)
  • metadata (labels, tags, size, affected services)
  • time (UTC instant + tenant-local TZ)

Order

  1. Schema Validation (blocking errors)
  2. Policy Rules (allow/deny + obligations)
  3. Edition Overlays (transform/limits)
  4. Approvals & SoD (may issue obligation: requires_approvals=n)
  5. Change Windows (may convert allow → scheduled)
  6. Rate & Quota Checks (edition quotas; ties to Gateway)

Outcomes

  • ALLOW | DENY | ALLOW_WITH_OBLIGATIONS
  • Obligations: { approvals: {required: n, roles: [...]}, scheduleAfter: <ISO8601>, requireCanary: true, maxBlastRadius: N }
  • Explanation: list of matched rules & reasons.

Contracts — Policy Engine (PDP)

REST (gRPC analogs provided)

  • POST /v1/validate{ violations[] } (JSON Schema + custom keywords)
  • POST /v1/decide{ effect, obligations, explanation[] }
  • POST /v1/risk{ score, findings[] }
  • GET /v1/bundles/:id → policy artifact materialization (signed)

Decision request (excerpt)

{
  "tenantId": "t-123",
  "editionId": "pro",
  "environment": "prod",
  "operation": "config.publish",
  "actor": { "sub": "u-9", "roles": ["developer"], "groups": ["team-billing"], "mfa": true },
  "change": { "breaking": 1, "additive": 12, "neutral": 5, "affectedServices": 4 },
  "time": "2025-08-25T10:00:00Z",
  "tz": "Europe/Berlin",
  "metadata": { "labels": ["payments","blue"] }
}

Decision response (excerpt)

{
  "effect": "ALLOW_WITH_OBLIGATIONS",
  "obligations": {
    "approvals": { "required": 2, "roles": ["approver","tenant-admin"], "sod": true },
    "requireCanary": true,
    "scheduleAfter": null
  },
  "explanation": [
    "rule:deny-breaking-without-2-approvals matched",
    "rule:prod-requires-change-window matched (window active)"
  ]
}

Schema Validation

  • JSON Schema 2020-12 per namespace/path with $ref composition.
  • Custom keywords:
    • x-ecs-secretRef: true → value must be kvref://...
    • x-ecs-maxBlastRadius: <int> → used in risk scoring
    • x-ecs-deprecated: "message" → warning surface only
  • Where enforced: Studio save, CI import, Registry publish, Adapter sync.
  • Performance target: ≤ 200 ms p95 for 5k-key set validation.

Example (snippet)

{
  "$id": "https://schemas.connectsoft.io/ecs/db.json",
  "type": "object",
  "properties": {
    "connectionString": { "type": "string", "x-ecs-secretRef": true },
    "poolSize": { "type": "integer", "minimum": 1, "maximum": 200 }
  },
  "required": ["connectionString"]
}

Rules Engine (ECS DSL)

Structure

apiVersion: ecs.policy/v1
kind: RuleSet
metadata: { name: tenant-prod-guardrails, tenantId: t-123 }
spec:
  rules:
    - name: deny-plaintext-secrets
      when: op == "config.publish" && env == "prod" && any(change.paths, . endsWith "connectionString" && !isSecretRef(.value))
      effect: DENY
      message: "Secrets must be kvref:// references"

    - name: breaking-needs-2-approvals
      when: op == "config.publish" && env == "prod" && change.breaking > 0
      effect: ALLOW_WITH_OBLIGATIONS
      obligations:
        approvals.required: 2
        approvals.roles: ["approver","tenant-admin"]
        approvals.sod: true
        requireCanary: true

    - name: pro-plan-limits
      when: edition == "pro"
      effect: ALLOW_WITH_OBLIGATIONS
      obligations:
        quotas.maxConfigKeys: 10000
        quotas.maxRpsRead: 600

    - name: prod-change-window
      when: op in ["config.publish","deploy.start"] && env == "prod" && !withinWindow("sat-22:00..sun-04:00", tz)
      effect: ALLOW_WITH_OBLIGATIONS
      obligations:
        scheduleAfter: nextWindow("sat-22:00..sun-04:00", tz)

Operators

  • Logical: && || !
  • Comparators: == != > >= < <= in
  • Helpers: any()/all(), withinWindow(), nextWindow(), isSecretRef()
  • Context vars: op, env, edition, tenantId, labels[], change.*

Edition/Tenant/Environment Overlays

Precedence global < edition < tenant < environment < namespace

  • Edition gates: feature toggles, quotas (RPS/keys/TTL), security posture (mTLS required).
  • Tenant customizations: additional controls (SoD strictness, approver roles).
  • Environment overrides: windows, rollout strategies.
  • Namespace: schema variants and local guardrails.

Bundles compiled into a single effective ruleset per (tenant, environment) at evaluation time; cached by PDP.


Approvals & Segregation of Duties (SoD)

Model

  • Approval objects created when obligations require it.
  • Approver pools resolved from roles/groups; SoD enforces author ≠ approver; if riskScore ≥ threshold, approvers must come from distinct groups.
  • Escalation: time-boxed delegation possible (delegatedTo, expiresAt, ticket).
  • Evidence: risk report, diff patch, validation output embedded.

Approval Matrix (default)

Condition Required Approvals SoD Notes
Non-prod, non-breaking 0 Immediate publish
Prod, additive only 1 Approver role
Prod, breaking 2 ✓✓ Approver + Tenant Admin; canary required
High blast radius (> N services) 2 ✓✓ Or schedule inside window
Emergency (break-glass) 1 (Platform Admin) Auto-create post-incident review task

Change Windows

  • Defined per tenant/environment using CRON+TZ or range expressions.
  • PDP helper withinWindow(window, tz) respects IANA time zones and DST.
  • Enforcement: outside window, decisions add scheduleAfter obligation; Platform Admin can override with isBreakGlass=true (audited).

Examples

windows:
  prod:
    - name: weekend-window
      cron: "0 22 * * SAT"    # start
      duration: "6h"
      tz: "Europe/Berlin"
    - name: freeze
      range: "2025-12-20T00:00..2026-01-05T23:59"
      allow: false

Risk Scoring

Findings → score (weights configurable)

  • Schema violations: blocking (no score)
  • Breaking changes: +30
  • Secrets change count: +5 each (cap 30)
  • Blast radius (services affected): +1 each (cap 20)
  • Out-of-window: +10
  • Author role (developer vs admin): +0/−5 (experience credit)
  • Test coverage signal (from CI): −10 if integration tests pass

Thresholds

  • <20 → no approval
  • 20–49 → 1 approval
  • ≥50 → 2 approvals + canary

Enforcement Points

Point What is enforced How
Gateway (ext_authz) AuthZ decision for admin APIs; edition quotas POST /v1/decide with operation mapping (config.modify, config.delete)
Registry (pre-publish) Schema + rules for publish/rollback validate + decide with operation=config.publish
Studio (UX) Early validation, approver hints validate + risk for live feedback
Approvals Service Satisfy obligations Enforce SoD, windows, delegations
Adapter Hub Provider sync guardrails decide(operation=adapter.sync) for tenants with extra controls

Storage & Versioning

  • PolicyBundle(id, version, hash, signedBy, createdAt, scope={tenant|global})
  • EffectivePolicy(tenantId, env, bundleId, compiledHash, etag) — hot cache in PDP
  • Audit every decision, including inputs, effect, obligations, explanation, duration (ms)
  • Signing with Sigstore or KMS-backed JWS; PDP rejects unsigned or stale bundles.

Observability & Audit

  • Metrics: pdp.decisions_total{effect}, pdp.duration_ms (hist), pdp.cache_hit_ratio, approvals.pending, windows.scheduled_ops
  • Logs: structured with decisionId, tenantId, operation, rulesMatched[]
  • CloudEvents:
    • ecs.policy.v1.PolicyUpdated
    • ecs.policy.v1.DecisionRendered (optional sampling)
    • ecs.policy.v1.ApprovalRequired/Granted/Rejected
    • ecs.policy.v1.WindowScheduleCreated

Performance & Availability Targets

  • PDP decision p95 < 5 ms (cached), p99 < 20 ms (cold compile)
  • Validate+decide in publish flow < 150 ms p95 for 5k-key diff
  • PDP HA: 3 replicas/region, sticky by (tenantId, env); warm compiled cache on deploy
  • Cache TTL: bundle cache 60s; revalidate on PolicyUpdated event

Test Matrix (conformance)

Area Scenario Expectation
Schema invalid secret ref validate fails, code SCHEMA.SECRET_REF
Rules breaking change prod ALLOW_WITH_OBLIGATIONS, approvals=2, canary=true
Windows publish outside window obligation scheduleAfter != null
SoD author approves own change rejection SOD.VIOLATION
Edition pro tier quotas exceeded DENY, code QUOTA.EXCEEDED
Risk high blast radius score ≥ 50 → 2 approvals

Acceptance Criteria (engineering hand-off)

  • PDP service with /validate, /decide, /risk implemented (REST + gRPC), OTEL enabled.
  • DSL parser & evaluator with helpers (withinWindow, isSecretRef, nextWindow).
  • Bundle signer/loader + hot-reload on PolicyUpdated.
  • Gateway ext_authz integration mapping route → operation.
  • Approvals service honoring obligations with SoD and scheduling; Studio UI surfaces requirements.
  • End-to-end tests for publish with approvals, out-of-window scheduling, and rollback governance.

Solution Architect Notes

  • Choose ECS DSL as primary for readability; keep OPA compatibility flag for enterprises with existing Rego policies.
  • Define a policy pack per edition and allow tenant overrides only by additive (stricter) rules, never weakening base controls.
  • Add simulation mode in Studio: run decide against a proposed change to preview approvals/windows before requesting them.

Security Architecture — data protection, secrets, KMS, key rotation, multi-tenant isolation, threat model

Objectives

Establish a defense-in-depth security design for ECS aligned to ConnectSoft’s Security-First, Compliance-by-Design approach. This section operationalizes data protection, secrets/KMS, key rotation, multi-tenant isolation, and a threat model that maps to concrete controls, runbooks, and acceptance criteria.


Trust Boundaries & Security Plan

flowchart LR
  subgraph Internet
    User[Studio Users]
    SDKs[SDKs/Services]
  end

  WAF[WAF/DoS Shield]
  Envoy[API Gateway (JWT/JWKS, RLS, ext_authz)]
  YARP[YARP (BFF/Internal Gateway)]

  subgraph Core[ECS Services]
    REG[Config Registry]
    POL[Policy Engine (PDP)]
    ORC[Refresh Orchestrator]
    HUB[Provider Adapter Hub]
  end

  CRDB[(CockroachDB)]
  REDIS[(Redis Cluster)]
  BUS[(Event Bus)]
  KMS[(Cloud KMS/HSM)]
  VAULT[(Secret Stores: KeyVault/SecretsManager)]
  LOGS[(Audit/Telemetry Store)]

  User-->WAF-->Envoy-->YARP-->REG & POL & ORC
  SDKs-->WAF-->Envoy
  REG--mTLS-->CRDB
  REG--mTLS-->REDIS
  ORC--mTLS-->BUS
  HUB--mTLS-->Providers[(Azure/AWS/Consul/Redis/SQL)]
  REG--sign/enc-->LOGS
  Services--KEK/DEK ops-->KMS
  Services--secret refs-->VAULT
Hold "Alt" / "Option" to enable pan & zoom

Controls by boundary

  • Edge: WAF + IP reputation, geo & residency gate, TLS 1.2+, JWT validation (aud/iss/exp/nbf), rate limits.
  • Service mesh: mTLS (SPIFFE/SPIRE or workload identity), least-privilege network policies.
  • Data: encryption at rest, envelope encryption for sensitive blobs, append-only audits, tamper-evident chains.

Data Classification & Protection

Class Examples Storage At Rest In Transit Additional
Public API docs, OpenAPI Object store/CDN Provider default TLS CSP/SRI
Internal Metrics, non-PII logs LOGS Disk enc. + log signing (optional) TLS PII redaction
Confidential Config values (non-secret), diffs CRDB JSONB TDE/cluster enc mTLS ETag/ETag-hash only
Restricted Secret refs, tokens, credentials Vault/KMS KMS-backed TLS/mTLS No plaintext in DB
Audit-critical Approvals, policy decisions CRDB (append-only) Enc + hash-chain TLS/mTLS WORM export optional

Policy: No plaintext secrets are persisted in Snapshots; only kvref URIs (see below).


Secrets Architecture

Secret References (kvref://)

  • Format: kvref://{provider}/{path}#version?opts
    • Example: kvref://vault/billing/db-conn#v3
  • Resolved at read time by Resolver/SDK with caller’s auth context; never materialized into snapshots.
  • Secret providers: Azure Key Vault, AWS Secrets Manager, optional HashiCorp Vault.

Resolution flow

  1. Registry materializes canonical config with secret refs.
  2. SDK/Resolver sees kvref://… keys and fetches from provider using short-lived credentials (workload identity or token exchange).
  3. Values cached in SDK memory only (no persistent secret caching). TTL from provider.

Guardrails

  • Schema keyword x-ecs-secretRef: true required for fields like connectionString.
  • Studio blocks plaintext entry; provides secret picker UI.

Key Management (KMS/HSM) & Crypto Posture

Key Hierarchy (Envelope Encryption)

graph TD
  CMK[Root Customer Managed Key (per region)] -->|wrap/unwrap| TMK[Tenant Master Key (per tenant)]
  TMK -->|wrap| DEK[Data Encryption Keys (resource-scoped)]
  DEK -->|encrypt| SENS[Sensitive blobs (e.g., idempotency records with PII hints, optional)]
Hold "Alt" / "Option" to enable pan & zoom
  • CMK: Regional KMS/HSM (e.g., Azure Key Vault Managed HSM / AWS KMS CMK).
  • TMK: Derived per tenant; rotatable without re-encrypting data (envelope rewrap).
  • DEK: Per table/feature or per artifact class; rotated automatically on cadence.

Cryptographic Controls

  • TLS 1.2/1.3, strong ciphers; OCSP stapling at edge.
  • Hashing: SHA-256 for ETags and lineage hashes; HMAC-SHA256 for value integrity in Redis.
  • Optional: JWS signing of Snapshots/Audit exports (future toggle).

Rotation & Credential Hygiene

Asset Rotation Method Blast Radius Control
JWKS (IdP keys) 90 days Add new kid, overlap 24h, retire old Edge monitors jwt_authn_failed
API Keys (webhooks) 90 days Dual-key period; HMAC header sha256= Per-tenant scope
KMS CMK Yearly Automatic rotation Rewrap TMKs opportunistically
TMK (per tenant) 180–365 days Rewrap DEKs in background Tenant-scoped jobs
Service creds (bus/redis/sql) 90 days Workload identity preferred; otherwise secret rotation + pod bounce Per namespace
Redis ACLs 90 days Dual ACL users; rotate, then drop Per region
Adapter bindings 90 days Lease-based CredentialEnvelope.not_after Auto-rebind

Runbook excerpts

  • Key rollover: publish new JWKS; canary Envoy; monitor; retire old.
  • TMK rewrap: schedule low-traffic window; checkpoint progress; pause tenant writes if needed (rare).

Multi-Tenant Isolation

Data Plane

  • CRDB: REGIONAL BY ROW with tenant_id in PK; repository guards enforce tenant scoping.
  • Redis: key prefixes ecs:{tenant}:{env}:…; ACLs restrict prefixes per tenant for dedicated caches.
  • Events: CloudEvents carry tenantid extension; per-tenant topics/partitions to avoid cross-talk.

Control Plane

  • JWT claims: tenant_id, scopes; ext_authz checks at Envoy; PDP obligations enforce edition limits.
  • Studio: SoD & role checks; no cross-tenant UI elements.

Compute & Network

  • K8s namespaces per environment; NetworkPolicies deny east-west by default; adapters run out-of-process with mTLS.
  • SPIFFE identities: spiffe://ecs/{service}/{env}; RBAC by SPIFFE ID for inter-service calls.

Isolation Tests (every release)

  • Negative tests: attempt cross-tenant read/write via forged headers → expect 403.
  • Timing analysis: ensure response timing doesn’t leak tenant existence.
  • Event leakage: subscribe to foreign tenant topic → no messages.

Threat Model (STRIDE→Controls)

Threat Example Control (Design) Detective
**S**poofing Forged JWT Envoy JWT verify (aud/iss/kid); mTLS internal; token exchange with act chain AuthN failure metrics, anomaly IP alerts
**T**ampering Modify snapshot content Append-only tables; ETag content hash; optional JWS signatures; change requires new version Hash verification on read; audit hash-chain check
**R**epudiation “I didn’t approve” Signed decisions; SoD; immutable audit with actor sub@iss; IP/device fingerprint Audit dashboards, integrity checks
**I**nformation disclosure Secret leakage in logs Secret refs only; structured logging with redaction; no secrets persisted PII/secret scanners in CI and runtime
**D**oS Resolve flood or publish storm RLS per tenant; SDK backoff; long-poll waiter caps; cache SWR; circuit breakers 429/latency SLO alerts, waiter pool saturation alarms
**E**levation of privilege Dev acts as admin Role->scope mapping; PDP policies; break-glass with time-box + extra audit Privileged action audit, approval anomalies
Supply chain Image tamper SBOM + cosign signing; admission policy; CVE scanning; base images pinned Image signature verification, scanner alerts
SSRF via Adapters Malicious provider URL Allowlist provider endpoints; egress policies; input validation & timeouts Egress deny logs
Replay Replayed webhook/API calls HMAC with timestamp; Idempotency-Key for POST; short clock skew Replay detector metrics
Data residency EU data in US Regional routing; crdb_region pin; residency policy at PDP Residency audit reports

Application-Level Security Controls

  • Input validation: JSON Schema enforcement server-side; size caps; content-type allowlist.
  • Output encoding/CORS: locked origins for Studio; headers X-Content-Type-Options, X-Frame-Options, Referrer-Policy.
  • Least-privilege IAM: service principals scoped to exact resources; adapter credentials per-binding.
  • Secure defaults: TLS required; HTTP downgraded traffic rejected; same-site cookies (Lax or Strict) for Studio.

Observability & Evidence

  • OTEL everywhere; security spans tag: tenantId, actor, decisionId, effect.
  • Security metrics: authz_denied_total, jwt_authn_failed_total, rate_limited_total, policy_decision_ms, secret_ref_resolution_ms.
  • Audit exports: signed/hashed; scheduled to WORM storage if enabled.

Incident Response (IR) & Playbooks

Key compromise (tenant TMK)

  1. Quarantine tenant (PDP deny writes; read-only).
  2. Rotate TMK; rewrap DEKs; re-issue creds.
  3. Force SDK token renewal; invalidate Redis keys; emit tenant-wide RefreshRequested.
  4. Forensics: export audit; notify tenant.

Web token abuse

  1. Revoke client app; rotate JWKS if issuer compromised.
  2. Invalidate refresh tokens; raise auth challenge requirements (MFA).
  3. Search logs for sub anomalies; notify tenant.

Data exfil suspicion

  • Activate access freeze policy; export and verify audit hash chain.
  • Run residency check; compare CRDB region distribution.

Compliance Hooks

  • Access reviews: quarterly role attestations; report who has tenant.admin/platform-admin.
  • Data subject requests (DSAR): audit index supports by-tenant export.
  • Pen-testing cadence: semi-annual + post-major release; findings tracked in risk register.
  • Third-party adapters: vendor security questionnaire + sandbox tenancy.

Acceptance Criteria (engineering hand-off)

  • mTLS enabled service-to-service (SPIFFE) with policy-enforced identities.
  • Secret refs (kvref://) enforced by schema; resolver libraries for .NET/JS/Mobile implemented.
  • KMS envelope encryption library with CMK/TMK/DEK operations + rewrap CLI.
  • Rotation jobs & runbooks codified (JWKS, TMK, service creds); canary + rollback procedures.
  • Tenant isolation tests automated in CI; event leakage tests in staging.
  • WAF, RLS, ext_authz configured; security dashboards and alerts live.
  • SBOM generation and image signing required in CI; admission policy verifies signatures.

Solution Architect Notes

  • Evaluate confidential computing (TEE) for server-side resolution of secrets for high-assurance tenants.
  • Consider per-tenant Redis in Enterprise for stronger isolation at cost increase.
  • Add JWS snapshot signing behind a feature flag to strengthen non-repudiation.
  • Plan red team exercise focused on adapter SSRF and multi-tenant boundary bypass.

Observability — OTEL spans/metrics/logs, trace taxonomy, dashboards, SLOs/SLIs/alerts

Objectives

Deliver an observability-first design that provides end-to-end traces, actionable metrics, and structured logs across ECS services (Gateway, YARP, Registry, Policy Engine, Refresh Orchestrator, Adapter Hub, SDKs). Define trace taxonomy, metrics catalogs, dashboards, and SLOs/SLIs/alerts with multi-tenant visibility and low-cardinality guardrails.


Architecture & Dataflow

flowchart LR
  subgraph Client
    SDKs[SDKs / Studio]
  end

  WAF[WAF/DoS]
  Envoy[Envoy (Edge)]
  YARP[YARP (BFF)]
  REG[Config Registry]
  PDP[Policy Engine (PDP)]
  ORC[Refresh Orchestrator]
  HUB[Provider Adapter Hub]
  CRDB[(CockroachDB)]
  REDIS[(Redis)]
  BUS[(Event Bus)]

  COL[OTEL Collector]
  TS[(Time-series DB)]
  LS[(Log Store/Index)]
  TMS[(Tracing DB)]
  APM[Dashboards/Alerting]

  SDKs-->WAF-->Envoy-->YARP-->REG
  REG-->PDP
  REG-->REDIS
  REG-->CRDB
  REG-->BUS
  ORC-->BUS
  HUB-->Providers[(External Providers)]
  HUB-->BUS

  Envoy--OTLP-->COL
  YARP--OTLP-->COL
  REG--OTLP-->COL
  PDP--OTLP-->COL
  ORC--OTLP-->COL
  HUB--OTLP-->COL
  COL-->TS
  COL-->LS
  COL-->TMS
  APM---TS
  APM---TMS
  APM---LS
Hold "Alt" / "Option" to enable pan & zoom

Collector topology

  • DaemonSet + Gateway collectors; tail-based sampling at gateway.
  • Pipelines: traces (OTLP → sampler → TMS), metrics (OTLP → TS), logs (OTLP/OTLP-logs → LS).
  • Exemplars: latency histograms carry trace IDs to deep-link from metrics to traces.

Trace Taxonomy & Propagation

Naming (verb-noun, dot-scoped)

Layer Span name Kind
Edge ecs.http.server (Envoy) SERVER
BFF ecs.http.proxy (YARP routeId) SERVER/CLIENT
Registry ecs.config.resolve, ecs.config.publish, ecs.config.diff SERVER
PDP ecs.policy.decide, ecs.policy.validate SERVER
Orchestrator ecs.refresh.fanout, ecs.refresh.invalidate INTERNAL
Adapter Hub ecs.adapter.put, ecs.adapter.watch CLIENT/SERVER
DB db.sql.query (CRDB) CLIENT
Redis cache.get, cache.set, cache.lock CLIENT
Bus messaging.publish, messaging.process PRODUCER/CONSUMER
SDK ecs.sdk.resolve, ecs.sdk.subscribe CLIENT

Required attributes (consistent keys)

  • tenant.id, env.name, app.id, edition.id, plan.tier
  • route, http.method, http.status_code, grpc.code
  • user.sub (hashed), actor.type (user|machine)
  • config.path or config.set_id, etag, version
  • policy.effect (allow|deny|obligate), policy.rules_matched
  • cache.hit (true|false|swr), cache.layer (memory|redis)
  • db.statement_sanitized (true)
  • rls.result (ok|throttled)
  • error.type, error.code

Context propagation

  • W3C Trace Context: traceparent + tracestate
  • Baggage (limited): tenant.id, env.name, app.id (≤3 keys)
  • CloudEvents: include traceparent extension; correlationid mirrors trace ID for non-OTLP consumers.
  • Headers surfaced to services: x-correlation-id (stable), traceparent (for logs).

Sampling

  • Head sampling default 10% for healthy traffic (dynamic).
  • Tail-based rules (max 100%):
    • http.status >= 500 or grpc.code != OK
    • Latency above p95 per route
    • policy.effect != allow
    • route in [/deployments, /snapshots, /refresh]
    • Tenant allowlist (troubleshooting sessions)

Metrics Catalog (SLI-ready, low-cardinality)

Dimensions used widely: tenant.id (hashed), env.name, region, route|rpc, result, avoid key/path labels.

Edge & Gateway

  • http_requests_total{route,code,tenant} (counter)
  • http_request_duration_ms{route} (histogram with exemplars)
  • ratelimit_decisions_total{tenant,result} (counter)

Registry / Resolve path

  • resolve_requests_total{result} (counter: hit|miss|304|200)
  • resolve_duration_ms{result} (histogram)
  • cache_hits_total{layer} (counter)
  • waiter_pool_size (gauge), coalesce_ratio (gauge)
  • etag_mismatch_total (counter)

Publish / Versioning

  • publish_total{result} (counter)
  • publish_duration_ms (histogram)
  • diff_duration_ms (histogram), snapshot_size_bytes (histogram)
  • policy_block_total{reason} (counter)

PDP

  • pdp_decisions_total{effect} (counter)
  • pdp_duration_ms (histogram)
  • pdp_cache_hit_ratio (gauge)

Refresh Orchestrator & Eventing

  • refresh_events_total{type} (counter)
  • propagation_lag_ms (histogram: publish→first 200 resolve)
  • ws_connections{tenant} (gauge), ws_reconnects_total (counter)
  • dlq_messages{queue} (gauge), replay_total{result} (counter)

Adapters

  • adapter_ops_total{provider,op,result} (counter)
  • adapter_op_duration_ms{provider,op} (histogram)
  • adapter_throttle_total{provider} (counter)
  • adapter_watch_gaps_total{provider} (counter)

Data stores

  • CRDB: sql_qps, txn_restarts_total, replica_leaseholders_ratio, kv_raft_commit_latency_ms
  • Redis: hits, misses, latency_ms, evictions, blocked_clients

Logs (structured, privacy-safe)

Envelope (JSON)

{
  "ts":"2025-08-25T10:21:33.410Z",
  "sev":"INFO",
  "svc":"registry",
  "msg":"Resolve completed",
  "trace_id":"d1f0...",
  "span_id":"a21e...",
  "corr_id":"c-77f2...",
  "tenant.id":"t-***5e",
  "env.name":"prod",
  "route":"/api/v1/resolve",
  "duration_ms":42,
  "cache.hit":true,
  "etag":"9oM1hQ...",
  "user.sub_hash":"u-***9a",
  "code":"OK",
  "extra":{"coalesced":3}
}

Redaction & hygiene

  • Never log config values or secrets; keys hashed when necessary.
  • PII hashed/salted; toggle dev sampling only in non-prod.
  • Enforce log schema via collector processors; drop fields not in schema.

Dashboards (role-focused)

1) SRE — “ECS Golden Signals”

  • Edge: request rate, error %, p95 latency by route, 429 rate.
  • Resolve: hit ratio (memory/redis), resolve p95, waiter size/coalesce ratio.
  • Propagation: publish→first resolve lag p95/p99, WS connections, DLQ size.
  • DB/Cache: CRDB restarts & raft latency, Redis latency & evictions.
  • Burn-rate tiles for SLOs (see below) with auto-links to exemplar traces.

2) Product/Platform — “Tenant Health”

  • Per-tenant overview: usage vs quota, top endpoints, rate-limit hits, policy denials.
  • Drill-down: deployment success rate, change lead time (draft→publish).

3) Security — “Policy & Access”

  • PDP decision mix, denied reasons, break-glass actions, secret-ref validation failures.
  • Geo & residency conformance (data access by region).

4) Adapters & Eventing

  • Provider op success/latency by adapter, throttle events, watch gaps.
  • Event bus throughput, consumer lag, DLQ trends, replay outcomes.

SLOs, SLIs & Alerts

Service SLOs (initial targets)

Service/Path SLI SLO (rolling 30d) Error budget
Resolve (SDK) Availability = 1−(5xx+network errors)/all 99.95% 21m/mo
Latency p95 (in-region) ≤ 250 ms N/A
Publish (Studio/API) Success rate for publish operations 99.5% 3.6h/mo
Publish→first client refresh p95 ≤ 5 s (Pro/Ent), ≤ 15 s (Starter) N/A
PDP Decision p95 ≤ 5 ms N/A
WS Bridge Connection stability (drop rate per hour) ≤ 1% N/A

Definitions

  • Propagation lag = time from ConfigPublished accepted to first successful client Resolve(200/304) with new ETag per tenant/env.

Multi-window burn-rate alerts (error-budget based)

  • Page: BR ≥ 14 over 5m and ≥ 7 over 1h
  • Warn: BR ≥ 2 over 6h and ≥ 1 over 24h

Symptom alerts (key signals)

  • Resolve p95 > 300 ms (10m)
  • Rate-limit 429% > 2% (5m) for any tenant
  • Propagation lag p95 > 8 s (15m) or DLQ size > 100 for 10m
  • PDP decision p99 > 20 ms (10m)
  • Redis evictions > 0 (5m) sustained

All alerts annotate current incidents with links to exemplar traces and last deployments.


Collector & Policy (reference snippets)

Tail-based sampling (OTEL Collector)

receivers:
  otlp: { protocols: { http: {}, grpc: {} } }

processors:
  attributes:
    actions:
      - key: tenant.id
        action: hash
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: errors
        type: status_code
        status_codes: {status_codes: [ERROR]}
      - name: long-latency
        type: latency
        latency: {threshold_ms: 250}
      - name: important-routes
        type: string_attribute
        string_attribute: {key: http.target, values: ["/deployments", "/snapshots", "/refresh"]}
exporters:
  otlphttp/traces: { endpoint: "https://trace.example/otlp" }
  prometheus: { endpoint: "0.0.0.0:9464" }
  loki: { endpoint: "https://logs.example/loki/api/v1/push" }

service:
  pipelines:
    traces: { receivers: [otlp], processors: [attributes, tail_sampling], exporters: [otlphttp/traces] }
    metrics: { receivers: [otlp], processors: [], exporters: [prometheus] }
    logs:    { receivers: [otlp], processors: [attributes], exporters: [loki] }

Correlation & Troubleshooting Playbooks

  • Start from SLO tiles → click exemplar to open failing trace.
  • From trace: check policy.decide span (effect & rules), cache layer tags, db/sql timings.
  • If 429s spike: inspect ratelimit_decisions_total by tenant; apply temporary override or identify hot route.
  • Propagation issues: check DLQ and refresh.fanout spans; verify Redis latency; perform targeted replay.
  • Regional anomalies: compare leaseholder distribution and raft commit latency.

Cardinality & Cost Guardrails

  • Disallow high-cardinality labels: raw key/path, user IDs, stack traces in metrics.
  • Hash tenant IDs and user subjects; limit unique operation names.
  • Log retention: 14d (non-audit) default; 30–365d for audit in dedicated store.
  • Metrics retention: 13 months downsampled; traces 7–14d with error traces kept longer.

Acceptance Criteria (engineering hand-off)

  • OTEL SDKs enabled in all services and client SDKs (.NET/JS/Mobile) with required attributes.
  • Collector pipelines deployed (daemonset + gateway) with tail-based sampling and attribute hashing.
  • Metrics emitted per catalog; histograms with exemplars enabled.
  • Structured JSON log schema enforced; PII/secret redaction verified.
  • Dashboards delivered for SRE, Tenant Health, Security, Adapters/Eventing.
  • SLOs encoded in monitoring system with burn-rate alert policies and on-call rotations.
  • Runbooks published for 429 spikes, propagation lag, DLQ growth, PDP degradation.

Solution Architect Notes

  • Consider exemplars+RUM in Studio to tie user actions to backend traces.
  • Add feature-flagged debug sampling per tenant (temporary, auto-expires) to diagnose issues without global sampling changes.
  • Revisit metrics cardinality quarterly; adopt RED/USE dashboards for each microservice as standard templates.

Performance & Capacity — read-heavy scaling plan, Redis sizing, p99 targets, load profiles, KEDA/HPA

Objectives

Design a read-heavy, latency-bounded capacity plan that meets ECS SLOs under bursty, multi-tenant loads. This section specifies p99 targets, traffic models, Redis sizing, and autoscaling policies (KEDA/HPA) for Registry/Resolve APIs, Refresh Orchestrator, WS/Long-Poll bridges, and Adapter Hub.


Targets & Guardrails

Path Region p95 p99 Notes
Resolve (cache hit) in-region ≤ 50 ms ≤ 150 ms CPU-bound; Redis round-trip avoided
Resolve (cache miss) in-region ≤ 200 ms ≤ 400 ms Redis hit + CRDB read-through
Long-Poll wake→200/304 in-region ≤ 1.5 s ≤ 3 s From publish to client response
WS push→client revalidate in-region ≤ 2.5 s ≤ 5 s Aligns with propagation SLOs
Publish (snapshot→accepted) in-region ≤ 1.0 s ≤ 2.0 s Excludes canaries/approvals

Error budgets inherit from Observability cycle: Resolve availability ≥ 99.95%.


Traffic Modeling (read-heavy)

Let:

  • N_t = tenants, A = apps/tenant, E = envs/app, S = sets/env, K = keys/set
  • H = cache hit ratio at SDK (L1) → target ≥ 0.85
  • R_sdk = average client read rate per instance (req/s) when polling/long-poll refreshes
  • C = active client instances

Resolve QPS to Gateway/Registry

  • Periodic pull: QPS = C × R_sdk × (1 − H)
  • Long-poll (recommended): steady-state QPS ≈ C × (timeout⁻¹) × (1 − H), with held connections ≈ C. Example: C=20k, timeout=30 s, H=0.9QPS ≈ 20k × (1/30) × 0.1 ≈ 66.7 RPS.
  • WS push + conditional fetch: QPS ≈ events × fanout_selectivity × (1 − H_etag); typical << long-poll.

Memory for Long-Poll Waiters

  • Per waiter ~ 1–3 KB (request ctx + selector). With C=20k waiters → 20–60 MB of heap headroom per region (well within a small pool of pods).

Redis Sizing & Topology

Key formulas

  • Hot value size B (compressed on wire): target 8–32 KB (avg take 16 KB).
  • Cache entryB × overhead_factor where overhead (key + object + allocator) ≈ 1.3–1.5. Use 1.4.
  • Working set per region WS ≈ active_keys × B × 1.4.

Shard plan

  • Aim per shard ≤ 25 GB resident (for 64 GB nodes → memory headroom & AOF/RDB).
  • Shards per region = ceil(WS / 25 GB), then ×2 for replicas.

Example (Pro/Ent region)

  • Tenants actively reading: 1,000
  • Active keys/tenant hot: 800
  • active_keys = 1,000 × 800 = 800,000
  • WS ≈ 800,000 × 16 KB × 1.4 ≈ 17.9 GB
  • Shards needed: 1 (primary) + 1 (replica) → deploy 3 shards to allow headroom and growth, then rebalance.

Throughput

  • Single shard (modern VM, network-optimized) sustains 80–120 k ops/s p99 < 2 ms for GET/SET under pipeline.
  • With hit ratio ≥ 0.9, Registry avoids CRDB on ≥ 90% of reads.

Policies

  • Eviction: volatile-lru (Enterprise) / allkeys-lru (Starter/Pro).
  • SWR ≤ 2 s to bound staleness and prevent stampedes.
  • Hash-tagging keys by {tenant} to isolate hotspots and ease resharding.

Pod Right-Sizing & Concurrency

Component CPU req/lim Mem req/lim Concurrency (target) Notes
Registry/Resolve (.NET) 250m / 1.5c 512 Mi / 1.5 Gi 400–600 in-flight Kestrel + async IO, thread-pool min tuned
WS Bridge 200m / 1c 256 Mi / 1 Gi 5k conns/pod uWSGI/Kestrel; heartbeat @ 30 s
Refresh Orchestrator 200m / 1c 256 Mi / 1 Gi 200 msgs/s CPU-light, IO-heavy
Adapter Hub 300m / 1.5c 512 Mi / 1.5 Gi 1k ops/s Batch writes; backpressure aware

Autoscaling (HPA v2 + KEDA)

Registry/Resolve — HPA with custom metrics

  • Primary: in-flight request gauge http_server_active_requests (target ≤ 400/pod)
  • Secondary: CPU target 60%
  • Tertiary: p95 latency guardrail (scale-out if > 200 ms for 5 m)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: ecs-registry }
spec:
  minReplicas: 4
  maxReplicas: 48
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_server_active_requests
        target:
          type: AverageValue
          averageValue: "400"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent; value: 100; periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent; value: 25; periodSeconds: 60

WS Bridge — HPA on connection count

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: ecs-ws-bridge }
spec:
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric: { name: ws_active_connections }
        target: { type: AverageValue, averageValue: "4000" }  # keep <4k/pod

Refresh Orchestrator — KEDA on bus depth/throttle

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: ecs-refresh-orchestrator }
spec:
  scaleTargetRef: { name: ecs-refresh-orchestrator }
  pollingInterval: 5
  cooldownPeriod: 60
  minReplicaCount: 2
  maxReplicaCount: 40
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: ecs.refresh.events
        namespace: sb-ecs
        messageCount: "500"          # 1 replica per 500 pending
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: propagation_lag_ms
        threshold: "3000"            # scale if lag > 3s

Adapter Hub — KEDA on ASB/Rabbit and provider throttles

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: ecs-adapter-hub }
spec:
  scaleTargetRef: { name: ecs-adapter-hub }
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: adapter_throttle_total
        threshold: "0"               # scale up if throttling detected
    - type: prometheus
      metadata:
        metricName: adapter_ops_backlog
        query: sum(rate(adapter_ops_queued_total[1m]))
        threshold: "200"

Load Profiles & Playbooks

LP-1 Baseline Steady State

  • Long-poll connections equal to active clients; low QPS due to conditional 304s.
  • SLO focus: latency p99 and WS stability.
  • Action: monitor waiter pool size, Redis hit ratio ≥ 0.9.

LP-2 Publish Wave (canary→regional)

  • Burst of RefreshEvents, clients revalidate.
  • Action: Orchestrator coalesces events; Registry scales on in-flight requests; Redis lock singleflight enabled.

LP-3 Tenant Hotspot

  • One tenant updates many keys; selectivity high.
  • Action: Per-tenant rate limit clamps; Redis hash-tag isolates shard; temporary override if approved.

LP-4 Cold Start / Region Failover

  • Rapid connection re-establish; cache cold.
  • Action: Pre-warm Redis with last heads; disable aggressive scale-down; surge HPA caps increased for 30 min.

LP-5 Adapter Backpressure

  • Provider throttles (Azure/AWS).
  • Action: KEDA scales Hub, adjusts batch sizes; DLQ monitored; replay after throttle clears.

Capacity Examples (per region)

Scenario A — 20k clients, long-poll 30 s, H=0.9

  • Waiters: 20k; Gateway pods with 4k conns/pod → 5 pods minimum.
  • Resolve QPS: ~67 RPS; 4 Registry pods (400 in-flight each) → ample headroom.
  • Redis: WS ≈ 10–20 GB3-shard cluster (1 primary, 1 replica each) for headroom.

Scenario B — 100k clients, mixed WS/poll

  • 70% WS, 30% long-poll; H=0.85.
  • Long-poll QPS ≈ 30k × (1/30) × 0.15 ≈ 150 RPS.
  • WS Bridge: 100k conns total → 25 pods (4k/pod).
  • Registry: 8–12 pods depending on p95.
  • Redis WS ~ 40–60 GB3–4 primaries + replicas.

Scale numbers must be validated in perf tests; start with headroom over 30-day peak.


CRDB Considerations (read-through misses)

  • Keep leaseholders local to region (REGIONAL BY ROW).
  • Target miss rate ≤ 10%; CRDB nodes sized for 2–5k QPS aggregated reads with p99 < 20 ms KV.
  • Monitor txn_restarts_total; if > 2% elevate LIMITS and slow publishers.

Performance Testing Plan

Tool Purpose Profile
k6 / Locust Resolve & long-poll VUs ramp to 100k conns; p95/p99 latency
Vegeta Publish bursts 10–50 RPS publish; measure propagation lag
Custom WS harness Push & reconnect storms 100k connections, 1% churn/min
Redis benchmark Cache ops pipelined GET/SET @ 128B–64KB values

Exit criteria

  • Sustained p99 Resolve within targets at 1.5× projected peak.
  • No DLQ growth during publish wave; propagation p95 within ≤ 5 s.
  • HPA/KEDA converges within < 90 s of surge onset.

Cost & Efficiency Levers

  • Prefer long-poll default; enable WS for high-value tenants only.
  • SWR and TTL jitter reduce cache churn by 20–40% under waves.
  • Coalescing ratio target ≥ 3× (one backend fetch serves ≥ 3 waiters).
  • Scale-to-zero for low-traffic Adapters via KEDA cooldown.

Runbooks (snippets)

Latency p99 regression

  1. Check coalesce_ratio and waiter_pool_size; if low, enable singleflight tuning.
  2. Inspect Redis latency_ms and evictions; add shard if > 3 ms p95 or any evictions.
  3. Verify HPA not pinned; temporarily raise maxReplicas.

Propagation lag > 8 s

  1. Validate bus consumer lag; KEDA scale orchestrator.
  2. Check WS connections; if drops > 1%/h, investigate bridge GC/ephemeral port exhaustion.
  3. Redis lock contention > 10 ms → increase lock TTL and reduce batch size.

Acceptance Criteria (engineering hand-off)

  • HPA manifests for Registry and WS Bridge; KEDA ScaledObjects for Orchestrator and Adapter Hub committed.
  • Redis cluster charts with shard math and hash-tag strategy documented; alerts for evictions>0.
  • Perf harnesses & CI jobs executing weekly; publish p95/p99 & coalescing ratio to dashboards.
  • Capacity workbook with what-if calculators (tenants, clients, TTL, H) shared with SRE.
  • Runbooks for failover, surge, hot tenant, adapter throttle scenarios.

Solution Architect Notes

  • Maintain two autoscaling lanes: fast (in-flight, latency) and slow (CPU). Tune stabilization to avoid oscillation.
  • Revisit connection per pod limits after GC tuning; consider SO_REUSEPORT listeners for WS scale.
  • For very large tenants, consider enterprise per-tenant Redis or keyspace quotas to cap blast radius.
  • Evaluate async Resolve (server push of materialized blobs to edge caches) if hit ratio < 0.8 at scale.

Resiliency & Chaos — timeouts/retries/backoff, bulkheads, circuit breakers, chaos experiments & runbooks

Objectives

Engineer ECS to degrade gracefully under dependency faults, traffic spikes, and regional impairments. This section codifies timeouts, retries, and backoff, bulkheads, circuit breakers, and a chaos program with experiments and runbooks. Defaults are opinionated yet overridable via per-service policy.


Resiliency Profiles (standardized)

# Resiliency profiles are shipped as config + code defaults; services load on boot.
profiles:
  standard:
    timeouts:
      http_connect_ms: 2000
      http_overall_ms: 4000
      grpc_unary_deadline_ms: 250          # Resolve path (in-region)
      grpc_batch_deadline_ms: 2000
      grpc_stream_idle_min: 60
      redis_ms: 200
      sql_read_ms: 500
      sql_write_ms: 1500
      bus_send_ms: 2000
    retries:
      strategy: "decorrelated_jitter"      # FullJitter/EqualJitter allowed
      max_attempts: 3
      base_delay_ms: 100
      max_delay_ms: 800
      retry_on: ["UNAVAILABLE","DEADLINE_EXCEEDED","RESOURCE_EXHAUSTED","5xx"]
      idempotency_required: true           # enforced by client for POST/PUT/DELETE
    circuit_breaker:
      sliding_window: "rolling"
      window_size_sec: 30
      failure_rate_threshold_pct: 50
      min_throughput: 20
      consecutive_failures_to_open: 5
      open_state_duration_sec: 30
      half_open_max_concurrent: 5
    bulkheads:
      max_inflight_per_pod: 600            # Registry
      per_tenant_max_inflight: 80
      redis_pool_size: 256
      sql_pool_size: 64
      http_pool_per_host: 128
    fallback:
      serve_stale_while_revalidate_ms: 2000
      long_poll_timeout_s: 30
      downgrade_to_poll_on_ws_fail: true
  conservative:                             # for cross-region or under incident
    timeouts:
      grpc_unary_deadline_ms: 400
      redis_ms: 300
    retries:
      max_attempts: 2
      base_delay_ms: 200
      max_delay_ms: 1200
    circuit_breaker:
      open_state_duration_sec: 60

Decorrelated jitter backoff (pseudo):

sleep = min(max_delay, random(b, sleep * 3))   # b = base_delay; start with sleep=b

Timeouts, Retries & Backoff — per dependency

Caller → Callee Timeout Retries (strategy) Preconditions
SDK → Registry (Resolve unary) 250–400 ms deadline 2–3 (jitter) Always send If-None-Match ETag; idempotent
SDK ↔ WS Bridge (push) Heartbeat 15–30 s, close after 3 misses Reconnect with backoff 0.5–5 s Resume token supported
Registry → Redis 200 ms 2 (jitter 50–250 ms) Use singleflight lock on fill
Registry → CRDB (read) 500 ms 1 (if RETRY_SERIALIZABLE) Read-only; bounded scans
Registry → CRDB (write) 1.5 s 2 (txn restart only) Idempotency key on publish
Orchestrator → Bus 2 s 5 (exp jitter up to 60 s) Outbox pattern ensures atomicity
Hub → Adapters 1–3 s per op 3 (provider-aware backoff) Idempotency key mandatory
Adapter → Provider (watch) Stream idle 60 min Auto-resume with bookmark Backpressure aware
Gateway → PDP 5–20 ms 0 (timeout only) Fallback deny on timeout (safe default)

Rules

  • Retries only on transient errors. Never retry on 4xx (except 429 with Retry-After).
  • Mutations require Idempotency-Key; otherwise retries are disabled.
  • Budget: total attempt duration ≤ caller timeout; avoid retry storms.

Bulkheads & Isolation

Concurrency bulkheads

  • Per-pod: cap in-flight requests (Registry ≤ 600), queue excess briefly (≤ 50 ms) then 429.
  • Per-tenant: token bucket at Gateway; defaults by edition (Starter 50 RPS, Pro 200, Ent 1000; burst ×2).
  • Per-dependency pools: separated HttpClient pools per host; Redis/SQL dedicated pools with backpressure.

Resource bulkheads

  • Thread pools: pre-warm worker count; limit sync over async.
  • Queue isolation: separate DLQ/parking lots per consumer to avoid global blockage.
  • Waiter map (long-poll): bounded dictionary; spill to 304 on exhaustion.

Blast-radius controls

  • Hash-tag keys {tenant} in Redis; circuit break per tenant first, then globally only if necessary.
  • Rate-limit publish waves (canary → region) via Orchestrator.

Circuit Breakers

State machine (per dependency, per tenant when applicable)

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: failure rate > threshold OR 5 consecutive failures
  Open --> HalfOpen: after open_duration
  HalfOpen --> Closed: probe successes (>=3/5)
  HalfOpen --> Open: any failure
Hold "Alt" / "Option" to enable pan & zoom

Telemetry & policy

  • Emit ecs.circuit.state gauge {service, dependency, tenant?, state}.
  • Closed: normal.
  • Open: SDKs and services fail fast with fallback (serve stale, downgrade to poll, deny writes).
  • Half-open: limit concurrent probes to ≤5.

Degrade Modes & Fallbacks

Fault Degrade Mode Fallback
WS Bridge down Switch clients to long-poll; disable optional features (live diff) Increase poll timeout to 45 s, jitter
Redis unavailable / high latency Bypass Redis; SWR serve stale ≤2 s; coalesce requests Tighten per-tenant concurrency; warm cache after recovery
CRDB read slowness Raise Resolve deadlines to 400 ms; prefer cache; shed low-priority traffic 429 on hot tenants; backpressure publishes
Bus throttling Slow fan-out, batch invalidations Replay after throttle; keep sdk long-poll working
Adapter throttling Reduce batch size; backoff; serialize writes Queue to outbox; DRY-RUN validation only
PDP degraded Gateway ext_authz TTL uses last good (≤ 60 s) Fail-closed on admin ops; show banner in Studio

Chaos Engineering Program

Steady-state hypothesis: Under defined faults, SLOs hold, or the system degrades predictably (bounded latency, clear errors), and recovers automatically without manual intervention.

Experiments Matrix

ID Fault Injection Scope Hypothesis Success Criteria
C-1 Add 200 ms latency to Redis Single region, 15 min p99 Resolve ≤ 400 ms; cache hit ratio dips < 10% No SLO breach; circuits stay closed; SWR < 2 s
C-2 Redis node killover 1 shard primary Operations continue; brief p99 blip No data loss; recovery < 30 s; 0 evictions
C-3 CRDB leaseholder move Hot range Miss path p99 ≤ 500 ms Txn restarts < 3%; no timeouts
C-4 Bus 429/ServerBusy Orchestrator consumers Propagation p95 ≤ 8 s DLQ stable; replay after clear
C-5 WS Bridge crash loop Region Clients auto-downgrade to long-poll Connections restore; no error budget burn
C-6 DNS blackhole to Adapter provider Single binding Hub circuits open per binding only Other tenants unaffected; queued ops replay
C-7 Partial partition (drop 10% packets) Gateway↔Registry Retries/backoff prevent storms No cascading failure; HPA scales sanely
C-8 Token/JWKS rollover mid-traffic Edge Zero auth errors beyond overlap Authn failure rate < 0.1% spike
C-9 Time skew 2 min on 10% pods Mixed services Token validation & TTL tolerate No systemic 401/304 anomalies
C-10 Region failover drill Full region RTO within playbook; SLOs in surviving regions No cross-tenant leakage; clear comms

Tooling

  • Layer 4/7 faults: Envoy fault filters, ToxiProxy.
  • Platform: chaos mesh or k6 chaos; K8s PDBs validated.
  • Data: CRDB workload generator, leaseholder move via ALTER TABLE ... EXPERIMENTAL_RELOCATE.

Schedule

  • GameDays quarterly, rotating owners; pre-approved windows.
  • Results recorded with hypotheses, evidence, fixes.

Runbooks (actionable)

RB-1 Resolve p99 > target (sustained 10 min)

  1. Dashboards: check cache_hits_total, redis.latency_ms, coalesce_ratio, waiter_pool_size.
  2. If Redis latency > 3 ms p95 → add shard, enable client-side SWR (2 s), increase Redis pool.
  3. If coalesce_ratio < 2 → raise waiter debounce to 300–500 ms.
  4. HPA: verify Registry replicas not capped; raise maxReplicas temporarily.

RB-2 429 spikes for a tenant

  1. Confirm token bucket at Gateway; inspect tenant QPS.
  2. If legitimate spike, increase burst temporarily; otherwise enable cooldown and inform tenant.
  3. Enable feature flag to reduce poll frequency for that tenant.

RB-3 DLQ growth in refresh pipeline

  1. Inspect latest DLQ messages (reason); if transient → bulk replay.
  2. If schema/contract mismatch → parking lot and open incident; restrict publish to canary.
  3. Scale Orchestrator via KEDA; check bus quotas.

RB-4 Adapter throttling

  1. Reduce batch size 50%, increase backoff to max 60 s.
  2. Mark binding degraded; notify tenant (Studio banner).
  3. After clear, replay pending ops; compare provider vs desired.

RB-5 WS instability (drop rate > 1%/h)

  1. Check ephemeral ports, GC pauses on pods; rotate nodes if needed.
  2. Switch affected tenants to long-poll via feature flag.
  3. Audit heartbeats & missed count; tune keep-alive timeouts.

RB-6 CRDB restart spikes / txn restarts > 3%

  1. Inspect hot ranges; consider index tweak or split.
  2. Increase SQL pool temporarily; ensure queries are parameterized and short.
  3. Scale Registry reads; verify leaseholder locality.

Implementation Guidance (services & SDKs)

.NET services

  • Use Polly (or built-in resilience pipeline in .NET 8) for retry/circuit/bulkhead.
  • One HttpClient per dependency/host; enable HTTP/2 for gRPC.
  • Redis: use pipelining and cancellation tokens; set SyncTimeout=redis_ms.

JS/TS SDK

  • Backoff via fetch wrapper; AbortController for deadlines.
  • WS reconnect with decorrelated jitter; resumeAfter cursor.
  • Persistent cache guarded by ETag; SWR toggled via server hint.

Mobile

  • Network reachability gating; background fetch limit; throttle under battery saver.

Config knobs (server)

  • All limits under resiliency.* with tenant/edition overrides; hot-reload on config change.
  • Emit current effective policy into metrics (resiliency_profile{name="standard"} gauge = 1).

Testing & Verification

  • Unit: retry/circuit behavior with virtual time; idempotency invariants.
  • Contract: simulate 429/503/UNAVAILABLE/DEADLINE_EXCEEDED across clients.
  • Load: soak tests 24h with chaos toggles (C-1..C-7).
  • Failover drills: at least semi-annual region evacuation exercises.

Acceptance Criteria (engineering hand-off)

  • Resiliency profile library packaged; services load standard by default with overrides.
  • Polly/Resilience pipelines configured for Registry, Orchestrator, Hub, WS Bridge with metrics and logs.
  • Circuit breaker telemetry and alerts active (ecs.circuit.state, failure_rate).
  • Chaos runners and manifests for experiments C-1..C-7 in staging; results documented.
  • Runbooks RB-1..RB-6 linked in on-call docs; PagerDuty alerts mapped to owners.

Solution Architect Notes

  • Keep retry budgets tight; retries amplify load—prefer fail fast + fallback.
  • Favor per-tenant breakers to preserve global health during hot-tenant incidents.
  • Extend chaos to client side (SDKs) in sandbox apps to validate downgrade paths (WS→poll, stale serves).
  • Consider adaptive concurrency (AIMD) at Gateway if 429s recur across many tenants.

Migration & Import — bootstrap pathways, bulk import, diff reconcile, blue/green config cutover

Objectives

Provide safe, repeatable pathways to bring existing configurations into ECS, reconcile differences, and execute a zero-downtime cutover to ECS-managed configuration. This section defines bootstrap options, bulk import contracts, three-way diff & reconcile, and blue/green cutover patterns with runbooks and acceptance criteria.


Migration Personas & Starting Points

Persona Starting System Typical Shape Primary Path
Platform Admin Fresh tenant No prior config Bootstrap Empty (starter templates)
SRE/DevOps Azure AppConfig / AWS AppConfig Hierarchical keys + labels Provider-Sourced Import via Adapter Hub
Backend Dev Files in Git (JSON/YAML) Namespaced files per env File-Based Import (CLI/Studio)
Operations Consul/Redis/SQL Flat/prefix keys Provider-Sourced Import + Mapping Rules

Bootstrap Pathways

1) Bootstrap Empty (Templates)

  • Create tenant, apps, envs, namespaces.
  • Seed starter Config Sets and policy packs (edition overlays).
  • Protect with schema validation out of the gate.
sequenceDiagram
  participant Admin
  participant Studio
  participant Registry
  Admin->>Studio: Create Tenant/Apps/Envs (wizard)
  Studio->>Registry: POST /tenants/... (idempotent)
  Registry-->>Studio: Seed templates + policies
Hold "Alt" / "Option" to enable pan & zoom

2) Provider-Sourced Import (Adapters)

  • Use Adapter Hub to read from source (Azure/AWS/Consul/Redis/SQL).
  • Produce intermediate snapshot in ECS format.
  • Run validate + diff against ECS baseline; reconcile, then publish.

3) File-Based Import (CLI/Studio)

  • Upload ZIP containing manifest.yaml + configs/*.json|yaml.
  • CLI validates schema locally, calculates hash, and performs idempotent batch import.

Import Data Model & Contracts

Canonical Import Manifest (YAML)

apiVersion: ecs.migration/v1
kind: ImportBundle
metadata:
  tenantId: t-123
  source: azure-appconfig://appcfg-prod?label=prod
  changeId: mig-2025-08-25T10:00Z  # idempotency key
spec:
  defaultEnvironment: prod
  mappings:
    - sourcePrefix: "apps/billing"
      targetNamespace: "billing"
      environment: "prod"
      keyTransform: "stripPrefix('apps/billing/')"  # helpers: stripPrefix, toKebab, replace
      contentTypeRules:
        - match: "**/*.json"
          contentType: "application/json"
  items:
    - key: "apps/billing/db/connectionString"
      valueRef: "kvref://vault/billing/db-conn#v3"   # secrets as refs
      meta: { labels: ["prod","blue"] }
    - key: "apps/billing/featureToggles/enableFoo"
      value: true
      meta: { contentType: "application/json" }

Rules

  • Secrets: only valueRef (kvref://…) allowed for sensitive fields; plaintext rejected by policy.
  • Idempotency: metadata.changeId required; server dedupes full bundle and each batch chunk.
  • Size limits: default 5k items/bundle (configurable); chunks of ≤500 items.

REST & CLI

POST /v1/imports (multipart/zip or application/json)
Headers:
  Idempotency-Key: mig-2025-08-25T10:00Z
Response: { importId, statusUrl }

GET  /v1/imports/{importId}/status

CLI

ecsctl import apply ./bundle.zip --tenant t-123 --dry-run
ecsctl import plan ./bundle.zip   # prints diff summary
ecsctl import approve <importId>  # kicks reconcile+publish with policy gates

Three-Way Diff & Reconcile

States

  • Desired: Import bundle (or provider snapshot post-mapping)
  • Current: ECS head (latest published Snapshot/Version)
  • Last-Applied: Previous import’s applied hash (if any) to avoid flip-flop
flowchart LR
  D[Desired] --- R{Reconcile Engine}
  C[Current] --- R
  L[Last-Applied] --- R
  R --> Patch[JsonPatch + Semantic diff]
  Patch --> Plan[Change Plan: upserts/deletes/moves]
Hold "Alt" / "Option" to enable pan & zoom

Algorithm

  1. Normalize keys & canonicalize JSON.
  2. Compute structural diff (RFC 6902) and semantic annotations (breaking/additive).
  3. Build Change Plan:
    • Group by namespace/env.
    • Respect ignore rules (e.g., provider metadata keys).
    • For conflicts (key renamed vs deleted), prefer rename if mapping rule indicates.
  4. Validate with JSON Schema + Policy PDP → may yield obligations (approvals, canary).

Outcomes

  • Dry-run report: counts (add/remove/replace), risk score, policy obligations.
  • Apply mode: create Draft, then Snapshot on success (idempotent by content hash).

Bulk Import from Providers (Adapters)

sequenceDiagram
  participant Admin
  participant Hub as Adapter Hub
  participant A as Provider Adapter
  participant Reg as Registry
  Admin->>Hub: StartImport(tenant/env/ns, source)
  Hub->>A: List(prefix, pageSize=500)
  A-->>Hub: Items(page) + etags
  loop until done
    Hub->>Reg: POST /imports:chunk (batch=≤500)
    Reg-->>Hub: ChunkAccepted (hash)
    Hub->>A: nextPage()
  end
  Hub->>Reg: FinalizeImport(changeId)
  Reg-->>Admin: Plan ready (diff + obligations)
Hold "Alt" / "Option" to enable pan & zoom

Resiliency

  • Each chunk has idempotency key: <changeId>-<pageNo>.
  • Backpressure: throttle to respect provider quotas; exponential backoff on 429/ServerBusy.
  • DLQ for malformed items with replay token.

Blue/Green Config Cutover

Goal: Switch consumers from Blue (current alias) to Green (new snapshot) without downtime and with instant rollback.

Mechanics

  • Publish new snapshot → tag semver (e.g., v1.9.0).
  • Pin environment alias: prod-nextv1.9.0 (Green).
  • Canary rollout (optional): restrict RefreshEvents to a slice of services.
  • Promote: flip prod-current alias from Blue → Green.
  • Rollback: re-point prod-current to prior Blue snapshot (previous alias pointer).
sequenceDiagram
  participant Studio
  participant Registry
  participant Orchestrator
  Studio->>Registry: Tag v1.9.0; Alias prod-next -> v1.9.0
  Orchestrator-->>Services: Refresh(selectors=canary)
  Note right of Services: validate metrics & errors
  Studio->>Registry: Alias prod-current -> v1.9.0 (cutover)
  Orchestrator-->>All Services: Refresh(all)
  Studio->>Registry: Rollback (if needed) -> prod-current -> v1.8.3
Hold "Alt" / "Option" to enable pan & zoom

Guarantees

  • Cutover changes only alias pointers (constant-time).
  • SDKs always revalidate via ETag; if unchanged, 304 path is cheap.
  • Rollback emits ConfigPublished for the restored alias, preserving at-least-once semantics.

Workflows

A) File-Based Import → Plan → Approve → Cutover

  1. Prepare bundle (ecsctl import plan).
  2. Dry-run in Studio: policy results + risk score; attach change window if prod.
  3. Approve (SoD, required approvers).
  4. Apply: create Draft → Snapshot; tag vX.Y.Z.
  5. Canary (optional): prod-nextvX.Y.Z; verify metrics.
  6. Cutover: prod-currentvX.Y.Z.
  7. Audit: export plan, approvals, result.

B) Provider-Sourced Live Sync (One-time Migration)

  1. Use Adapter to List source keys; apply mappings.
  2. Run Reconcile; fix schema violations (add secret refs).
  3. Freeze writes at source (change window).
  4. Apply plan; cutover aliases.
  5. Lock ECS as source of truth (Optional: disable reverse sync).

Mapping & Transformation Rules

Capability Options
Key transforms stripPrefix, replace(pattern,repl), toKebab, toCamel, lowercase
Label/env mapping Provider labels → ECS environment or meta.labels[]
Content types Infer from extension or explicit contentTypeRules
Secret detection Regex + schema hints → require kvref://
Ignore sets Drop provider control keys (e.g., _meta, __system/*)

Validation

  • Mapping rules tested in Preview panel with sampled keys.
  • Rules stored with ImportBundle to ensure reproducibility.

Safety & Idempotency

  • Idempotency-Key on bundle + chunks; server returns 200/AlreadyApplied if identical.
  • Snapshot creation idempotent by content hash; repeat imports do not create duplicates.
  • Write guards: In prod, publish requires approvals & change window policy pass.
  • Quotas: import throughput rate-limited per tenant; defaults 200 items/s.

Observability

  • Spans: ecs.migration.import.plan, ecs.migration.import.apply, ecs.migration.reconcile, ecs.migration.cutover.
  • Metrics:
    • import_items_total{result=applied|skipped|invalid}
    • reconcile_conflicts_total{type=rename|delete|typeMismatch}
    • cutover_duration_ms
    • rollback_invocations_total
  • Logs: structured records with changeId, bundleHash, planHash; no values logged.

Failure Modes & Recovery

Failure Symptom Action
Schema violation Plan shows blocking errors Fix mapping or schema; re-plan
Secret plaintext detected Policy DENY Convert to kvref://; re-plan
Provider throttling Slow import Hub backs off; resumes; no data loss
Partial apply due to timeout Some chunks pending Re-POST with same changeId; idempotent
Bad cutover Error spikes Flip alias back to Blue; open incident; analyze diff

Runbooks

RB-M1 Plan & Dry-Run

  1. ecsctl import plan → review counts & policy summary.
  2. If risk ≥ threshold, escalate to Approver/Tenant Admin.

RB-M2 Approval & Apply

  1. Ensure change window active for prod.
  2. Approve in Studio; monitor pdp_decisions_total for obligations.
  3. Apply; verify publish_total{result="success"}.

RB-M3 Canary & Cutover

  1. Point prod-next to new version; watch propagation_lag_ms and service error %.
  2. If stable for N minutes, flip prod-current.
  3. If regression, rollback immediately (alias revert).

RB-M4 Provider Freeze & Final Sync

  1. Freeze writes on source; take final snapshot via adapter.
  2. Re-plan; apply minimal delta.
  3. Mark ECS authoritative; decommission source path.

Acceptance Criteria (engineering hand-off)

  • Import API + CLI support ZIP & JSON; enforces Idempotency-Key and size limits.
  • Reconcile engine implements three-way diff with semantic annotations and ignore sets.
  • Adapter Hub path supports paged list, chunked import, backoff, DLQ & replay.
  • Studio provides Plan view (diff + policy), Approval wiring, Cutover and Rollback buttons with audit.
  • Aliasing supports prod-next/prod-current conventions; cutover & rollback are O(1) pointer flips with events emitted.
  • End-to-end tests: file import, provider import, conflict resolution, blue/green cutover, rollback.

Solution Architect Notes

  • Prefer one-time import then lock ECS as source of truth to avoid dual-write drift; if bi-directional is unavoidable, enforce adapter watch + reconcile with clear owner.
  • Keep mapping rules versioned with import artifacts; they are part of compliance evidence.
  • For very large tenants, stage import by namespace and use canary cutovers per namespace to reduce risk.
  • Consider a read-only preview environment wired to prod-next for smoke testing with synthetic traffic before global cutover.

Compliance & Auditability — audit schema, retention, export APIs, SOC2/ISO hooks, PII posture

Objectives

Establish a tamper-evident, privacy-aware audit layer with clear retention policies, export/eDiscovery APIs, and baked-in hooks for SOC 2 and ISO 27001 evidence. Guarantee multi-tenant isolation, cryptographic integrity, and least-PII practices across all audit data.


Audit Model & Tamper Evidence

Event domains

  • Config Lifecycle: draft edits, validations, snapshots, tags/aliases, deployments, rollbacks.
  • Policy & Governance: PDP decisions, risk scores, obligations, approval requests/grants/rejects, change windows.
  • Access & Security: logins (Studio), token failures, role/permission changes, break-glass usage.
  • Adapters & Refresh: provider sync start/finish, drift detected, cache invalidations, replay actions.
  • Administrative: retention changes, exports, legal holds, backup/restore actions.

Canonical event (JSON Schema 2020-12 excerpt)

{
  "$id": "https://schemas.connectsoft.io/ecs/audit-event.json",
  "type": "object",
  "required": ["eventId","tenantId","time","actor","action","resource","result","prevHash","hash"],
  "properties": {
    "eventId": { "type": "string", "description": "UUIDv7" },
    "tenantId": { "type": "string" },
    "time": { "type": "string", "format": "date-time" },
    "actor": {
      "type": "object",
      "properties": {
        "subHash": { "type": "string" },          // salted hash, not raw subject
        "iss": { "type": "string" },
        "type": { "enum": ["user","service","admin","platform-admin"] },
        "ip": { "type": "string" }                // truncated / anonymized per policy
      }
    },
    "action": { "type": "string", "enum": [
      "Config.DraftEdited","Config.SnapshotCreated","Config.TagUpdated","Config.AliasUpdated",
      "Config.Published","Config.RolledBack","Policy.Decision","Policy.Updated",
      "Approval.Requested","Approval.Granted","Approval.Rejected",
      "Access.RoleChanged","Access.BreakGlass","Adapter.SyncCompleted","Refresh.Invalidate",
      "Export.Started","Export.Completed","Retention.Updated","Backup.Restore"
    ]},
    "resource": {
      "type": "object",
      "properties": {
        "type": { "enum": ["ConfigSet","Snapshot","Policy","Approval","Role","Adapter","Export","Tenant"] },
        "id": { "type": "string" },
        "path": { "type": "string" }             // never secret values; path hashed if sensitive
      }
    },
    "result": { "type": "string", "enum": ["success","denied","error"] },
    "diffSummary": { "type": "object", "properties": { "breaking": {"type":"integer"}, "additive":{"type":"integer"}, "neutral":{"type":"integer"} } },
    "policy": { "type":"object", "properties": { "effect":{"enum":["allow","deny","obligate"]}, "rulesMatched":{"type":"array","items":{"type":"string"}} } },
    "approvals": { "type":"array", "items": { "type":"object", "properties": { "bySubHash":{"type":"string"}, "at":{"type":"string","format":"date-time"}, "result":{"enum":["granted","rejected"]} } } },
    "etag": { "type": "string" },
    "version": { "type": "string" },
    "prevHash": { "type": "string" },
    "hash": { "type": "string" },                // SHA-256(prevHash || canonicalBody)
    "signature": { "type": "string" }            // optional JWS for daily manifests
  }
}

Hash chain & manifests

  • Per-tenant, per-day hash chain: every event stores prevHash and hash=SHA-256(prevHash||canonicalBody).
  • Daily manifest per tenant: { day, firstEventId, lastEventId, rootHash, count }, signed (JWS) with KMS key (Enterprise).
  • Verification: ecsctl audit verify --tenant t --from 2025-08-01 --to 2025-08-31 reconstructs chains and validates signatures.

Storage Tiers & Retention

Tier Store Purpose Default Retention (Starter / Pro / Enterprise)
Hot CockroachDB (event_audit) recent queries, Studio timelines 90d / 180d / 365d
Warm Object store (Parquet, partitioned by tenantId/day) eDiscovery, analytics — / 12m / 36m
Cold/Archive Object archive (WORM optional) long-term compliance — / — / 7y

Mechanics

  • Hot tier uses row-level TTL (ttl_expires_at) with hourly jobs.
  • Nightly compaction/export → Parquet (snappy), plus signed manifest & rootHash.
  • Legal hold flag on tenant prevents TTL purge and export deletion; records linked ticket id & actor.

Export & eDiscovery APIs

Filters & Query DSL

  • Filterable fields: time range, tenantId, environment, action, resource.type/id, actor.type, result, correlationId, policy.effect.
  • Simple DSL (AND by default, OR with |): time >= 2025-08-01 AND action:Config.Published AND result:success AND env:prod

Endpoints

POST /v1/audit/exports
Body:
{
  "tenantId": "t-123",
  "query": "time >= 2025-08-01 AND action:Config.Published",
  "format": "parquet|csv|ndjson",
  "redaction": { "hashSubjects": true, "truncateIp": true, "dropFields": ["resource.path"] },
  "sign": true,                      // Enterprise: attach JWS over manifest
  "encryption": "kms://key-ref",     // optional server-side encryption for export files
  "notify": ["mailto:secops@tenant.com"]
}

GET  /v1/audit/exports/{exportId}/status
GET  /v1/audit/exports/{exportId}/download   // presigned URL, time-limited
DELETE /v1/audit/exports/{exportId}          // marks for deletion if not on legal hold

Streaming/listing

GET /v1/audit/events?from=...&to=...&action=...&pageSize=1000&pageToken=...

Export manifest (signed)

{
  "exportId":"e-7c..",
  "tenantId":"t-123",
  "range":{"from":"2025-08-01T00:00:00Z","to":"2025-08-15T23:59:59Z"},
  "count": 12876,
  "files":[{"path":"s3://.../part-0001.snappy.parquet","sha256":"..."}],
  "rootHash":"...", "signature":"eyJhbGciOiJQUzI..."  // optional
}

Audit Access Model

Role Capabilities
Viewer Read hot timeline for own tenant; no export
Security Auditor Create/Download exports with redaction presets; verify chains
Tenant Admin Manage retention (within policy bounds), legal holds, export keys
Platform Admin Cross-tenant export (break-glass only, ticket required)

Segregation of Duties: authors of changes cannot approve their own changes and cannot delete/alter audit data. All audit access is itself audited.


SOC 2 / ISO 27001 Hooks

Control coverage map (examples)

Domain Control Objective Evidence Source
Change Management Approvals required; rollback procedures Audit events Approval.*, Config.Published, manifests
Logical Access Least privilege enforced Role/permission exports, PDP Decision samples
Logging & Monitoring Audit logging with integrity Hash chains, signed manifests, collector configs
Encryption Data at rest/in transit KMS key inventory, TLS config exports
Backup & Recovery Regular backups, restores tested Backup job logs, restore runbooks & attestations
Vendor Management Adapter access controls Adapter binding audits, credential rotations

Evidence automation

  • Weekly Evidence Pack job (per tenant, Enterprise): ZIP with
    • Signed audit manifest for the week
    • Access/role assignment CSV
    • Policy bundle digest + PDP cache etag
    • SLO dashboards (PDF exports)
    • Backup status & last restore drill summary
  • Webhooks: ecs.compliance.v1.EvidencePackReady for GRC integration.

PII Posture & Privacy

Principles

  • Minimize: audit stores metadata, never raw config values or secrets.
  • Pseudonymize: user identifiers stored as salted hashes (actor.subHash).
  • Masking: IPs truncated or anonymized per tenant policy; paths containing known PII patterns hashed.
  • Tagging: audit schema includes dataClass labels; exporters drop high-risk fields by default.

Data subject requests (DSAR)

  • DSAR search runs against audit metadata only; personal data is pseudonymized—responses include existence proofs without revealing sensitive content.
  • Erase: where law requires, erase user identifiers by rotating salt/mapper, preserving event integrity. Chain integrity is retained by erasing only derived PII fields, not event core.

Residency

  • Audit data written to region of tenant’s home region; exports enforce regional buckets.

Operational Schema & DDL (illustrative)

CREATE TABLE ecs.event_audit (
  tenant_id UUID NOT NULL,
  event_id UUID NOT NULL DEFAULT gen_random_uuid(),
  time TIMESTAMPTZ NOT NULL DEFAULT now(),
  actor_sub_hash STRING NOT NULL,
  actor_iss STRING NOT NULL,
  actor_type STRING NOT NULL,
  action STRING NOT NULL,
  resource_type STRING NOT NULL,
  resource_id STRING NULL,
  resource_path_hash STRING NULL,
  result STRING NOT NULL,
  diff_summary JSONB NULL,
  policy JSONB NULL,
  etag STRING NULL,
  version STRING NULL,
  prev_hash STRING NOT NULL,
  hash STRING NOT NULL,
  ttl_expires_at TIMESTAMPTZ NULL,
  correlation_id STRING NULL,
  crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
  CONSTRAINT pk_event_audit PRIMARY KEY (tenant_id, time, event_id)
) LOCALITY REGIONAL BY ROW
  WITH (ttl = 'on',
        ttl_expiration_expression = 'ttl_expires_at',
        ttl_job_cron = '@hourly');

CREATE INDEX ix_audit_action ON ecs.event_audit (tenant_id, action, time DESC);
CREATE INDEX ix_audit_resource ON ecs.event_audit (tenant_id, resource_type, resource_id, time DESC);

Observability of the Audit Pipeline

  • Metrics: audit_events_total{result}, audit_chain_verify_failures_total, audit_export_jobs_running, audit_export_duration_ms, legal_holds_total.
  • Traces: ecs.audit.record, ecs.audit.export, ecs.audit.verify (linked to x-correlation-id).
  • Alerts:
    • Chain verification failures > 0 in 5m
    • Export job failures > 0 in 15m
    • Hot-tier backlog (TTL job lag) > 30m

Runbooks

RB-C1: Verify chain integrity for a tenant/day

  1. ecsctl audit verify --tenant t-123 --date 2025-08-20
  2. If fail: fetch daily manifest; recompute locally; compare rootHash.
  3. If mismatch persists: open SEV-2, freeze retention, snapshot hot partition, start forensics.

RB-C2: Respond to auditor request (SOC 2)

  1. Create Evidence Pack for period → POST /v1/audit/exports with sign=true.
  2. Attach role/permission export and change management approvals.
  3. Provide verification steps & public cert/JWKS link.

RB-C3: Legal hold

  1. Set tenant legalHold=true with ticket ID.
  2. Confirm TTL job skips held partitions; replicate warm data to hold bucket.
  3. Audit all actions with LegalHoldSet event.

Acceptance Criteria (engineering hand-off)

  • Audit events emitted for all enumerated domains with hash chain fields populated.
  • Row-level TTL and nightly Parquet exports with signed manifests (Ent).
  • Export/eDiscovery API implemented with query DSL, redaction presets, KMS encryption, and presigned downloads.
  • Role-gated access with SoD; all audit access audited.
  • Evidence Pack automation and webhook delivered.
  • ecsctl commands for verify, export, manifest show.
  • DSAR procedures documented; salt rotation mechanism implemented for pseudonymized identifiers.

Solution Architect Notes

  • Keep manifest signing feature-flagged for Pro; enforce for Enterprise tenants handling regulated workloads.
  • Consider Merkle tree roots per hour to enable partial range verification at scale.
  • Align export schemas with common GRC tools to avoid ETL (field names stable, enums documented).
  • Schedule quarterly integrity drills that verify a random sample of tenants and produce a signed attestation.

Cost Model & FinOps — storage/cache costs per tenant, egress, adapter costs, throttling strategies

Objectives

Define a transparent, meter-driven cost model and FinOps practices so ECS can:

  • Attribute platform costs per tenant (showback/chargeback).
  • Forecast and steer spend for storage, cache, egress, adapters, and observability.
  • Enforce edition-aware throttles and autoscaling that protect SLOs and budgets.

Cost Architecture (what we meter & how we allocate)

flowchart LR
  subgraph Meters
    RQ[Resolve Calls]
    WS[WS/LP Connection Hours]
    RBH[Redis Bytes-Hours]
    SBH[Storage Bytes-Hours (CRDB)]
    EGR[Egress Bytes]
    EVT[Refresh Events]
    ADP[Adapter Ops (read/write/watch)]
    KMS[KMS/Secrets Ops]
    OBS[Telemetry (traces/metrics/logs)]
  end

  Meters --> UL[Usage Ledger (per-tenant, per-day)]
  UL --> CE[Cost Engine (unit price table by region/provider)]
  CE --> SB[Showback/Chargeback]
  CE --> BDG[Budgets & Alerts]
  CE --> FP[Forecasts (q/q)]
Hold "Alt" / "Option" to enable pan & zoom

Usage Ledger (authoritative)

Per tenant/day we persist:

  • resolve_calls, resolve_egress_bytes
  • ws_connection_hours, longpoll_connection_hours
  • redis_bytes_hours (avg resident bytes × hours)
  • crdb_storage_bytes_hours (table + indexes)
  • snapshots_created, snapshot_bytes_exported
  • refresh_events_published, dlq_messages
  • adapter_ops_{provider}.{get|put|list|watch}, adapter_egress_bytes
  • kms_ops, secret_resolutions
  • otel_spans_ingested, metrics_series, logs_gb

All meters are derived from production telemetry; aggregation jobs roll up hourly → daily to bound cardinality.


Unit Price Table (configurable per region/provider)

Meter Unit Key drivers Example field names (in config)
CRDB storage GB-hour data + indexes price.crdb.gb_hour[region]
Redis cache GB-hour memory footprint price.redis.gb_hour[region]
Internet egress GB API payloads, WS traffic price.egress.gb[region]
Event bus 1K ops publish + consume price.bus.kops[region]
Adapter API 1K ops provider calls price.adapter.{provider}.kops[region]
KMS/Secrets 1K ops decrypt/get price.kms.kops[cloud], price.secrets.kops[cloud]
Observability GB / span logs/metrics/traces price.obs.logs.gb, price.obs.traces.kspans

Do not hardcode cloud list prices in code; load them from ops-config, per account/contract.


Per-Tenant Cost Formulas (parametric)

Let P_* be unit prices, and U_* the usage meters for tenant T in period D.

Cost_CRDB(T,D)   = U_storage_gb_hours * P_crdb_gb_hour
Cost_Redis(T,D)  = U_redis_gb_hours   * P_redis_gb_hour
Cost_Egress(T,D) = U_egress_gb        * P_egress_gb
Cost_Bus(T,D)    = (U_bus_publish + U_bus_consume) / 1000 * P_bus_kops
Cost_Adapter(T,D)= Σ_provider (U_adapter_ops_provider / 1000 * P_adapter_provider_kops)
Cost_KMS(T,D)    = U_kms_ops / 1000 * P_kms_kops + U_secret_gets / 1000 * P_secrets_kops
Cost_Obs(T,D)    = U_logs_gb * P_logs_gb + U_spans_k * P_traces_kspans + U_metrics_series * P_metrics_series
Total(T,D)       = Σ all components

Attribution rules

  • Redis & CRDB bytes-hours: weighted by tenant_id partition sizes; Redis estimated from on-host key scans sampled each minute × hours.
  • Egress: sum response sizes at Gateway by tenant (compressed bytes on wire).
  • Adapter ops: counted at Adapter Hub (SPI calls); include retries (we also report retry rate to optimize).
  • Observability: charge low cardinality metrics as platform; high-volume logs/spans proportional to tenant action rate.

Unit Economics (starter baselines)

Define internal pricebook for showback and external chargeback (if applicable):

Edition Included monthly quota (guardrails) Overage rate examples
Starter 5M resolve_calls, 5 GB egress, 1 GB-mo storage, 1 GB-mo Redis per 1M resolves, per GB egress, per GB-mo storage/cache
Pro 50M resolves, 50 GB egress, 10 GB-mo storage, 5 GB-mo Redis, 5M bus ops tiered overage discounts
Enterprise Custom commit, premium WS contract-specific

Quotas power shaping & alerts; they are not hard limits unless policy enforces.


Cost Dashboards & Budgets

Tenant Cost Overview

  • Cost by component (CRDB, Redis, Egress, Bus, Adapter, KMS, Obs)
  • Unit drivers (resolves, events, bytes) + efficiency KPIs:
    • cache_hit_ratio, avg_payload_bytes, retry_rate, propagation_lag
  • Budget progress bars and anomaly score (see below)

Platform FinOps

  • Cost per region, per edition, per service
  • Redis GB-hours vs hit ratio; CRDB GB-hours vs version growth
  • Egress by route; adapters ops by provider; top 10 hot tenants

Budgets & Alerts

  • Soft budget: 80% warn, 100% alert, 120% cap recommendation
  • Anomaly detection (day-over-day z-score) on: egress, adapter ops, logs GB

Throttling & Cost-Shaping Strategies (edition-aware)

Cost Driver Strategy Control Point
Egress ETag everywhere, gzip, gRPC compression, field projection (exclude unchanged) Gateway/Resolve
Resolve QPS Default long-poll; WS only for eligible tenants; increase poll interval under load SDK feature flags
Redis GB-hours Cap value size (≤128KB), SWR ≤ 2s, LRU policy, hash-tag by tenant Registry/Redis
CRDB growth TTLs on audit/diffs, archive exports to object store, semantic diffs to reduce payload Registry/Jobs
Adapter ops Batch list/put, backoff on 429, delta cursors, import windows Adapter Hub
Observability Sample traces (tail-based), logs sampling on INFO, keep structured metrics only OTEL Collector
Bus ops Coalesce refresh events per ETag & scope, suppress duplicates within 250ms Orchestrator
WS costs Cap connections/tenant, idle timeouts, downgrade to long-poll when idle WS Bridge

Policy hooks

  • PDP can obligate tenants to reduce cost (e.g., raise poll interval) when over budget.

Example Tenant Scenarios (illustrative, not a quote)

Assume (for math only): P_redis_gb_hour=$0.02, P_crdb_gb_hour=$0.01, P_egress_gb=$0.08, P_bus_kops=$0.15, P_adapter_azapp_kops=$0.50.

Pro Tenant (monthly)

  • 30M resolves, avg payload 10 KB compressed, cache hit 90% → egress ≈ 3M * 10KB ≈ 28.6 GB
  • Redis avg resident 3 GB → 3 GB * 720 h = 2160 GB-h
  • CRDB storage avg 8 GB → 5760 GB-h
  • Bus ops 2M (publish+consume)
  • Adapter AzureAppConfig ops 500k

Costs

  • Redis: 2160 * 0.02 = $43.20
  • CRDB: 5760 * 0.01 = $57.60
  • Egress: 28.6 * 0.08 ≈ $2.29
  • Bus: 2,000k/1,000 * 0.15 = $300
  • Adapter: 500k/1,000 * 0.50 = $250
  • Subtotal ≈ $653 (+ observability overhead if tenant-heavy)

Actionables to reduce: raise hit ratio to 92%, coalesce refreshes, batch adapter writes.


Optimization Playbook (FinOps levers)

  1. Reduce payload size: prune keys, compress, project; target ≤12 KB p50.
  2. Increase cache hits: ensure SDKs default long-poll/WS + ETag; fix chatty clients.
  3. Right-size Redis: measure redis_bytes_hours; add shards only if latency p95 > 3 ms or evictions > 0.
  4. Trim storage: enforce TTLs on diffs/audit (per edition), export old versions to Parquet.
  5. Adapter efficiency: switch to delta sync; widen polling intervals; schedule imports in off-peak windows.
  6. Observability costs: cap log verbosity, keep histograms with exemplars; drop high-cardinality labels.
  7. Network: prefer in-region access (avoid x-region egress); push WS only for active namespaces.

Governance: Showback/Chargeback

  • Showback: monthly PDF/CSV per tenant with:
    • Usage meters, effective unit prices, total by component, trend chart, anomaly notes.
  • Chargeback (optional): map to SKU:
    • CFG-READ (per 1M resolves), CFG-STOR (per GB-mo), CFG-CACHE (per GB-mo), CFG-EGR (per GB), CFG-ADP-{provider} (per 1K ops).
  • Contract hooks: Enterprise tenants can pre-commit capacity (discounted) with alerts if sustained > 120% for 3 days.

Automation & Data Flow

  • Tagging/Labels: all resources include cs:tenant, cs:service, cs:env, cs:edition.
  • Export jobs push Usage Ledger to the cost platform (per cloud): CUR/BigQuery/Billing export + our enrichment.
  • Forecasting: Holt-Winters on resolves/egress to project 30/90-day spend; include seasonality from release cadence.

Cost-Anomaly Runbooks

RB-F1 Egress Spike

  1. Dashboard → per-route egress; find tenant & path.
  2. Check 304 rate; if low, investigate missing ETags or payload bloat.
  3. Temporarily raise client poll interval; enable field projection.

RB-F2 Adapter Op Surge

  1. Identify provider & binding; check throttle events.
  2. Increase batch size if under cap; otherwise backoff and schedule window.
  3. Notify tenant; lock dual-write if drift source identified.

RB-F3 Redis GB-hours Up

  1. Inspect top key sizes & TTL; enforce value size cap.
  2. Increase SWR window to 2s; review pinning policies.
  3. If still high, move cold namespaces to document aggregation (fewer large values, fewer keys).

RB-F4 Observability Overrun

  1. Lower log sampling on INFO; ensure tail-based tracing active.
  2. Drop unused metrics; dedupe labels; compress logs.
  3. Add budget guard at Collector; alert if exceeded.

HPA/KEDA & FinOps Coupling

  • Scale on business metrics (in-flight requests, lag) not raw CPU only.
  • Scale-down protection when budgets are healthy and latency SLOs are green; avoid oscillation that increases cost via cache churn.
  • Scheduled scale for known peaks (releases) to reduce reactive over-scale.

Acceptance Criteria (engineering hand-off)

  • Usage Ledger schema implemented; daily rollups persisted and queryable per tenant.
  • Cost Engine reads regional price tables; produces per-tenant daily cost with component breakdown.
  • Dashboards: Tenant Cost, Platform FinOps, Anomalies; budgets & alerts wired.
  • Edition quotas and PDP obligations to shape high-cost behaviors (poll interval, WS eligibility).
  • Showback export (CSV/PDF) and API endpoints for /billing/usage.
  • Runbooks RB-F1..RB-F4 published; anomaly jobs (z-score) live.
  • Unit tests validate meters vs traces/metrics; backfills for late-arriving telemetry.

Solution Architect Notes

  • Keep unit prices externalized; treat FinOps math as configuration, not code.
  • Revisit pricebook SKUs quarterly; align with actual cloud invoices.
  • Consider per-tenant Redis only for Enterprise with extreme isolation needs; otherwise hash-tag + ACLs suffice.
  • Evaluate request-level compression dictionaries for large JSON sections if payloads dominate egress.
  • Add cost SLOs (e.g., $/1M resolves target) to drive continuous efficiency without harming latency SLOs.

Deployment Topology — AKS clusters, namespaces, regions, multi-AZ, blue/green & canary patterns

Objectives

Specify how ECS is deployed on Azure Kubernetes Service (AKS) with multi-region, multi-AZ resilience; enforce clean isolation via namespaces; and standardize progressive delivery (blue/green & canary) for services and the Studio UI, aligned to data residency and tenant tiers.


High-Level Topology

flowchart LR
  AFD[Azure Front Door + WAF] --> EG[Envoy Gateway (AKS)]
  EG --> YARP[YARP BFF]
  EG --> REG[Config Registry]
  EG --> PDP[Policy Engine]
  EG --> ORC[Refresh Orchestrator]
  EG --> WS[WS/LP Bridge]
  EG --> HUB[Provider Adapter Hub]
  REG <--> CRDB[(CockroachDB Multi-Region)]
  REG <--> REDIS[(Redis Cluster)]
  ORC <--> ASB[(Azure Service Bus)]
  HUB <--> Providers[(Azure/AWS/Consul/Redis/SQL)]
  subgraph AKS Cluster (per region)
    EG
    YARP
    REG
    PDP
    ORC
    WS
    HUB
  end
Hold "Alt" / "Option" to enable pan & zoom

Edge & DNS

  • Azure Front Door (AFD) + WAF terminates TLS, performs geo-routing, and health probes.
  • Envoy Gateway in AKS handles per-route authn/authz, RLS, and traffic splitting for rollouts.

Environments, Clusters & Namespaces

Environment Cluster Strategy Namespaces (examples) Notes
dev 1 AKS per region (cost-optimized) dev-system, dev-ecs, dev-ops Lower SLOs; spot where safe
staging 1 AKS per region stg-ecs, stg-ops Mirrors prod; synthetic traffic
prod 2–3 AKS clusters per geo (one per region) prod-ecs, prod-ops, prod-ecs-blue, prod-ecs-green Blue/green via namespace swap

Namespace policy

  • NetworkPolicies deny-all by default; allow only service-to-service with SPIFFE IDs.
  • Separate ops namespace for collectors, Argo CD/Rollouts, KEDA, Prometheus.

Regions & Data Residency

  • Minimum 2 regions per geo (e.g., westeurope + northeurope) in active-active for stateless services; CockroachDB spans both with REGIONAL BY ROW (tenant home region pinned).
  • Tenants tagged with homeRegion; AFD routes to nearest allowed region (residency enforced).

Multi-AZ & Scheduling Policy

AKS nodepools

  • np-system (DSv5) for control & gateways (taints: system=true:NoSchedule).
  • np-services (DSv5) for app pods (balanced across zones ½/3).
  • np-memq (Memory-optimized) for Redis if self-managed (recommend Azure Cache for Redis).
  • np-bg for blue/green surges (autoscaled on demand).

Workload policies

  • topologySpreadConstraints: ½/3 zones, max skew 1.
  • PodDisruptionBudget: minAvailable 70% for stateless, 60% for WS Bridge.
  • nodeAffinity: pin CRDB/Redis clients to low-latency pools.
  • zone-aware readiness: Envoy routes only to pods Ready in the same zone by default.
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector: { matchLabels: { app: ecs-registry } }
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector: { matchLabels: { app: ecs-registry } }
          topologyKey: kubernetes.io/hostname

Data Plane Deployments

CockroachDB (managed/self-hosted)

  • Multi-region cluster; leaseholders localized to tenant home region; REGIONAL BY ROW tables.
  • Zone redundancy: node groups across AZ ½/3 per region; PDB + podAntiAffinity.
  • Backups to region-local blob storage; cross-region async copies nightly.

Redis

  • Prefer Azure Cache for Redis (Premium/Enterprise) with zone redundancy.
  • For self-managed: Redis Cluster with 3 primaries × 3 replicas per region; aof=appendfsync-everysec.
  • Key hash-tagging {tenant} to preserve per-tenant isolation.

Azure Service Bus

  • Premium namespace per geo; consumer groups per service; ForwardTo DLQ and parking-lot queues.

Delivery Patterns — Blue/Green & Canary

Service Blue/Green (namespace-based)

  • Each prod cluster hosts both prod-ecs-blue and prod-ecs-green.
  • AFD → Envoy splits traffic by header/cookie or percentage to Services in the target namespace.
  • Promote by flipping Envoy HTTPRoute/TrafficSplit to Green and preserving Blue for instant rollback.
# Envoy Gateway (Gateway API) HTTPRoute snippet
kind: HTTPRoute
spec:
  rules:
    - matches: [{ path: { type: PathPrefix, value: /api/ }}]
      filters:
        - type: RequestHeaderModifier # inject tenant headers if needed
      backendRefs:
        - name: ecs-registry-svc-green
          weight: 20
        - name: ecs-registry-svc-blue
          weight: 80

Canary (Argo Rollouts)

  • Strategy: 5% → 25% → 50% → 100% with Analysis between steps using Prometheus queries:
    • http_request_duration_ms{route="/resolve",quantile="0.99"} < 400
    • http_requests_error_ratio{route="/resolve"} < 0.5%
    • pdp_decision_ms_p99 < 20
  • Rollbacks on analysis failure; Blue remains untouched.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 3m }
        - analysis: { templates: [{ templateName: ecs-slo-checks }] }
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

Studio & Static Assets

  • Blue/green buckets (cdn-blue, cdn-green) behind AFD; swap origin on promote; immutable asset hashes prevent cache poison.

Configuration Cutover (Runtime)

  • Use alias flip (prod-current → vX.Y.Z) independent of service rollout (see Migration section).
  • Coordinate service canary with config canary for risky changes.

Autoscaling & Surge Capacity

  • HPA for services (in-flight requests, CPU) with zone-balanced scale out.
  • KEDA for event consumers (Service Bus queue depth; propagation lag).
  • Maintain surge nodepool (np-bg) with cluster autoscaler max surge to absorb blue/green double capacity.

GitOps & Promotions

  • Argo CD per cluster watches environment repos: apps/dev, apps/stg, apps/prod.
  • Promotions are PR-driven:
    1. Build → sign image (cosign) → update Helm chart values in stg.
    2. Run smoke + e2e; Argo Rollouts canary passes gates.
    3. Promote to prod-ecs-green; automated canary.
    4. Flip Envoy weights to 100% Green; archive Blue after bake.

Security & Secrets in Topology

  • Azure Workload Identity for AKS ↔ AAD; pod-level identities fetch secrets via AKV CSI Driver.
  • mTLS mesh (SPIFFE IDs) between services; Envoy ext_authz to PDP.
  • Per-namespace NetworkPolicies and Azure NSGs block lateral movement.

DR & Failover

Scenario Action RTO/RPO
Single AZ loss Zone spread + PDB continue service 0 / 0
Single region impairment AFD shifts to healthy region; CRDB serves local rows; WS/long-poll reconnect RTO ≤ 5 min / RPO 0
CRDB region outage Surviving region serves all read requests; writes for tenants pinned to failed region are throttled unless policy allows re-pin RTO ≤ 15 min / RPO ≤ 5 min (if re-pin)

Runbook hooks

  • Toggle read-only per tenant if their home region is down; optional temporary re-pin with audit.

Reference Manifests (snippets)

PodDisruptionBudget — Registry

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: pdb-ecs-registry, namespace: prod-ecs-green }
spec:
  minAvailable: 70%
  selector: { matchLabels: { app: ecs-registry } }

HPA — WS Bridge (connections)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: hpa-ws-bridge, namespace: prod-ecs-green }
spec:
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: ws_active_connections
        target:
          type: AverageValue
          averageValue: "4000"

NetworkPolicy — namespace default deny

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: prod-ecs-green }
spec:
  podSelector: {}
  policyTypes: ["Ingress","Egress"]

Observability for Rollouts

  • Rollout dashboards: error ratio, p95/p99 latency, WS reconnects, policy denials; burn-rate tiles during canary.
  • Exemplars link from latency histograms to canary pod traces.
  • AFD health + per-region SLIs visible to on-call; alerts wired to promotion pipeline.

Acceptance Criteria (engineering hand-off)

  • AKS clusters provisioned in 2+ regions, 3 AZs each; nodepools & taints in place.
  • Namespaces created with deny-all NetworkPolicies, SPIFFE/mTLS configured.
  • Envoy Gateway and Gateway API configured with traffic split; Argo Rollouts installed and integrated with Prometheus SLO checks.
  • Blue/green namespaces (prod-ecs-blue, prod-ecs-green) workable; promotion playbooks done.
  • HPA/KEDA policies committed for Registry, WS Bridge, Orchestrator, Adapter Hub.
  • AFD/WAF routes implemented with region & residency rules; health probes tied to Envoy readiness.
  • DR drills documented: AZ loss, region failover, CRDB re-pin; measured RTO/RPO recorded.

Solution Architect Notes

  • Prefer managed Redis and managed CockroachDB where available to simplify AZ operations.
  • Keep AFD origin groups per region to avoid cross-region hairpin under partial failures.
  • For high-value tenants, offer per-tenant canary labels (AFD header) to scope early traffic safely.
  • Consider Gateway API + Envoy advanced LB for zone-local routing to reduce cross-AZ latency and cost.

CI/CD & IaC — repo layout, pipelines, artifact signing, Helm/Bicep/Pulumi, env promotion policies

Objectives

Provide a secure-by-default, GitOps-first delivery system for ECS that:

  • Standardizes repo layout across services, SDKs, adapters, and infra.
  • Ships reproducible builds with SBOM, signatures, and provenance.
  • Uses Helm (K8s), Bicep (Azure), and optional Pulumi (multi-cloud/app infra).
  • Enforces environment promotion policies (SoD, approvals, change windows) integrated with the Policy Engine.

Repository Strategy

Repos (hybrid)

  • ecs-platform (monorepo): services (Registry, Policy, Orchestrator, WS Bridge, Adapter Hub), Studio BFF/UI, libraries.
  • ecs-charts: Helm charts & shared chart library.
  • ecs-env: GitOps environment manifests (Argo CD app-of-apps), per region/env.
  • ecs-infra: Azure infra (Bicep modules), optional Pulumi program for cross-cloud.
  • ecs-sdks: .NET / JS / Mobile SDKs.
  • ecs-adapters: provider adapters (out-of-process plugins), conformance tests.

Rationale: app code evolves rapidly (monorepo aids refactors); environment state and cloud infra are separated with stricter review controls.

Monorepo layout (ecs-platform)

/services
  /registry
  /policy
  /orchestrator
  /ws-bridge
  /adapter-hub
/libs
  /common (logging, OTEL, resiliency, auth)
  /contracts (gRPC/proto, OpenAPI)
/studio
  /bff
  /ui
/tools
  /build (pipelines templates, scripts)
  /dev (local compose, kind)
/.woodpecker | /.github | /azure-pipelines (CI templates)
/VERSION (semantic source of truth)

Build & Release Pipelines (templatized)

Pipeline stages (per service)

  1. Prepare: detect changes (path filters), restore caches (.NET/Node).
  2. Build: compile, unit tests, lint, license scanning.
  3. Security: SAST, dep scan (Dependabot/Snyk), container scan (Trivy/Grype).
  4. Package: container build (BuildKit), SBOM (Syft), provenance (SLSA v3+).
  5. Sign: cosign keyless (OIDC) or KMS-backed; attach attestations.
  6. Push: ACR (or GHCR) by immutable digest only.
  7. Deploy dev: update ecs-env/dev via PR (Argo CD sync).
  8. E2E/Contracts: run k6/gRPC contracts in dev namespace.
  9. Promote: open PRs to staging and then prod with gates (below).
flowchart LR
  A[Commit/PR]-->B[CI Build+Test]
  B-->C[Security Scans]
  C-->D[Image+SBOM+Provenance]
  D-->E[Cosign Sign]
  E-->F[Push to ACR (by digest)]
  F-->G[PR to ecs-env/dev]
  G-->H[ArgoCD Sync + E2E]
  H-->I{Gates Pass?}
  I--Yes-->J[PR to ecs-env/staging -> canary]
  J-->K[PR to ecs-env/prod -> blue/green]
  I--No-->X[Fail + Rollback]
Hold "Alt" / "Option" to enable pan & zoom

Example: GitHub Actions template (service)

name: ci-service
on:
  push: { paths: ["services/registry/**", ".github/workflows/ci-service.yml"] }
  pull_request: { paths: ["services/registry/**"] }
jobs:
  build:
    runs-on: ubuntu-22.04
    permissions:
      id-token: write     # OIDC for cosign keyless
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with: { dotnet-version: "8.0.x" }
      - uses: actions/cache@v4
        with:
          path: ~/.nuget/packages
          key: nuget-${{ runner.os }}-${{ hashFiles('**/*.csproj') }}
      - run: dotnet build services/registry -c Release
      - run: dotnet test services/registry -c Release --collect:"XPlat Code Coverage"
      - name: Build image
        run: |
          docker buildx build -t $IMAGE:$(git rev-parse --short HEAD) \
            --build-arg VERSION=${{ github.ref_name }} \
            -f services/registry/Dockerfile --provenance=true --sbom=true .
      - name: SBOM (Syft)
        run: syft packages dir:. -o spdx-json > sbom.json
      - name: Push & Sign (cosign keyless)
        env:
          COSIGN_EXPERIMENTAL: "true"
          IMAGE: ghcr.io/connectsoft/ecs/registry
        run: |
          docker push $IMAGE:$(git rev-parse --short HEAD)
          DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' $IMAGE:$(git rev-parse --short HEAD))
          cosign sign --yes $DIGEST
          cosign attest --type slsaprovenance --predicate provenance.json $DIGEST
      - name: Open PR to ecs-env/dev
        run: ./tools/build/update-image.sh ecs-env dev registry $DIGEST

Tagging scheme

  • Source: vX.Y.Z (SemVer) in /VERSION
  • Image: digest authoritative; tags for traceability: vX.Y.Z, sha-<7>
  • Chart: appVersion: vX.Y.Z, version: vX.Y.Z+build.<sha7>

Artifact Integrity: SBOM, Signing, Admission

  • SBOM: SPDX JSON produced at build; attached to image and stored in release assets.
  • Signing: cosign keyless (Fulcio) by default; Enterprise can use KMS-backed keys.
  • Provenance: SLSA level 3/3+ with GitHub OIDC or Azure Pipeline OIDC attestations.
  • Cluster admission: Kyverno/Gatekeeper policy requires:
    • image pulled by digest,
    • valid cosign signature from trusted issuer,
    • SBOM label present (org.opencontainers.image.sbom),
    • non-root, read-only FS, signed Helm chart (optional, helm provenance).

Example Kyverno snippet

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: verify-signed-images }
spec:
  rules:
    - name: check-cosign
      match: { resources: { kinds: ["Pod"] } }
      verifyImages:
        - imageReferences: ["ghcr.io/connectsoft/ecs/*"]
          attestors:
            - entries:
                - keyless:
                    issuer: "https://token.actions.githubusercontent.com"
                    subject: "repo:connectsoft/ecs-platform:*"

Helm Delivery Model

  • ecs-charts contains:
    • charts/ecs-svc (library chart with HPA, PDB, ServiceMonitor, NetworkPolicy, PodSecurity)
    • charts/registry, charts/policy, charts/orchestrator, etc. (wrap library chart)
  • Values layering:
    • values.yaml (defaults)
    • values.dev.yaml / values.stg.yaml / values.prod.yaml (env)
    • values.region.yaml (region overrides)
  • Argo CD renders Helm with env/region values from ecs-env repo; blue/green namespaces are separate Application objects.

Chart values (excerpt)

image:
  repository: ghcr.io/connectsoft/ecs/registry
  digest: "sha256:abcd..."        # immutable
resources:
  requests: { cpu: "250m", memory: "512Mi" }
  limits:   { cpu: "1500m", memory: "1.5Gi" }
hpa:
  enabled: true
  targetInFlight: 400
pdb:
  minAvailable: "70%"
otel:
  enabled: true

Azure IaC with Bicep (ecs-infra)

  • Modules per domain: network, aks, acr, afd, asb, redis, kv, monitor.
  • Stacks per environment/region: prod-weu, prod-neu, stg-weu, etc.

Bicep structure

/bicep
  /modules
    aks.bicep
    acr.bicep
    afd.bicep
    asb.bicep
    redis.bicep
    kv.bicep
  /stacks
    prod-weu.bicep
    prod-neu.bicep
    stg-weu.bicep

Snippet: AKS with Workload Identity + CSI KeyVault

module aks 'modules/aks.bicep' = {
  name: 'aks-weu'
  params: {
    clusterName: 'ecs-aks-weu'
    location: 'westeurope'
    workloadIdentity: true
    nodePools: [
      { name: 'system', vmSize: 'Standard_D4s_v5', mode: 'System', zones: [1,2,3] }
      { name: 'services', vmSize: 'Standard_D8s_v5', count: 6, zones: [1,2,3] }
    ]
    addons: { omsAgent: true, keyVaultCsi: true }
  }
}

Policy as Code

  • Bicep lint + OPA/Conftest gate: deny public IPs on private services, enforce HTTPS, AKS RBAC, diagnostic settings, encryption.

Drift & Cost

  • Nightly what-if with approval gates.
  • Infracost checks on PR for FinOps visibility.

Pulumi Option (app/edge infra or multi-cloud)

When multi-cloud or richer composition is needed, Pulumi TS/Go program can orchestrate:

  • AFD routes, DNS, CDN origins,
  • Argo CD apps (via Kubernetes provider),
  • Cross-cloud Secrets and KMS bindings.

Pulumi TS excerpt

import * as k8s from "@pulumi/kubernetes";
const appNs = new k8s.core.v1.Namespace("prod-ecs-green", { metadata: { name: "prod-ecs-green" }});
new k8s.helm.v3.Chart("registry",
  { chart: "registry", repo: "oci://ghcr.io/connectsoft/ecs-charts",
    namespace: appNs.metadata.name,
    values: { image: { digest: process.env.IMAGE_DIGEST } }});

Choice: Bicep for Azure account-level infra; Pulumi for higher-level orchestration (optional).


Environment Promotion Policies (gates)

Policy Integration

  • Stg→Prod promotions require PDP decide(operation=deploy.promote):
    • SoD enforced (author ≠ approver),
    • Risk score below threshold or 2 approvals,
    • Change window active for prod,
    • Evidence: test results, SLO canary checks green.

Automation

  • PR from bot updates ecs-env staging → triggers Argo Rollouts canary.
  • Upon success, bot opens prod PR with:
    • Helm value digest pin,
    • Rollout strategy set (weights, analysis templates),
    • Policy check job posts PDP outcome in PR status.

Required checks on prod PR

  • E2E suite ✅
  • SLO canary analysis ✅
  • Policy PDP decision ✅
  • Security attestations (cosign verify, SBOM present) ✅
  • Manual approval (Approver role) ✅

Change freeze

  • prod branch protected by freeze label; PDP enforces schedule; override requires break-glass with incident ticket.

Preview Environments

  • On PR, create ephemeral namespace pr-<nr> with limited quotas:
    • Deploy changed services + Studio UI, seeded with masked sample data.
    • TTL controller cleans after merge/close.
    • URL: https://pr-<nr>.dev.ecs.connectsoft.io.

Rollback Strategy

  • Service rollback: revert env manifest to previous digest; Argo CD sync; keep blue namespace warm.
  • Config rollback: flip alias to previous snapshot (constant-time).
  • Automated: if canary SLOs fail, Rollouts auto-rollback and block prod PR.

Secrets & Credentials in CI/CD

  • CI uses OIDC workload identity to:
    • obtain short-lived push token for ACR,
    • request cosign keyless cert,
    • fetch non-production secrets from KV via federated identity.
  • No long-lived secrets in CI; repo has secret scanning enforced.

Observability of Delivery

  • Pipelines emit OTEL spans: ecs.ci.build, ecs.ci.scan, ecs.cd.sync, ecs.cd.promote.
  • Metrics: build duration, success rate, MTTR for rollback, mean lead time.
  • Dashboards: Delivery (DORA) + Supply Chain (signature verification %, SBOM coverage).

Acceptance Criteria (engineering hand-off)

  • Repos created with scaffolds & templates; path-filtered CI in place.
  • CI templates produce signed images, SBOM, SLSA provenance, push by digest.
  • Kyverno/Gatekeeper admission verifying signatures & digest pinning.
  • ecs-charts library and per-service charts published; ecs-env Argo apps wired for dev/stg/prod (blue/green).
  • Bicep modules and environment stacks committed; what-if + Conftest on PR; Infracost enabled.
  • Promotion gates integrated with PDP decisions, SLO canary checks, and manual approvals.
  • Preview environments auto-spawn for PRs; TTL cleanup automated.
  • Runbooks for failed canary, admission rejection, rollback, infra drift.

Solution Architect Notes

  • Keep digest-only deploys non-negotiable; tags for humans, digests for machines.
  • Prefer keyless signing to reduce key mgmt overhead; provide KMS fallback for regulated tenants.
  • Centralize pipeline templates; services import them with minimal YAML.
  • Consider Argo CD Image Updater only for lower environments; prod should remain PR-driven with explicit digests.
  • Reassess scan noise quarterly; failing the build on high CVSS after a grace period keeps the supply chain healthy.

Operational Runbooks — on-call, incident playbooks, hotfix flow, config rollback drill, RTO/RPO

On-Call Model

Coverage & Roles

  • 24×7 follow-the-sun rotations per geo (AMER / EMEA / APAC) with a global Incident Commander (IC) pool.
  • Roles per incident:
    • Incident Commander (IC) — owns timeline/decisions, delegates.
    • Ops Lead — drives technical mitigation (AKS, Envoy, Redis, CRDB).
    • Service SME — Registry/PDP/Orchestrator/Adapters/Studio.
    • Comms Lead — customer/internal comms; status page.
    • Scribe — live notes, artifacts, evidence pack.
    • Security On-Call — joins if policy/access/secrets involved.

Handoff Checklists

Start of Shift

  • Open NOC dashboard, confirm SLO tiles green for Resolve/Publish/PDP.
  • Verify Pager healthy; run “test page” to self (silent).
  • Confirm last handoff notes + open actions.
  • Review scheduled change windows and release plans.

End of Shift

  • Update handoff doc with:
    • Open incidents, mitigations, remaining risk.
    • Any temporary overrides (rate limits, feature flags).
    • Pending hotfix/canary states.

Severity & Response Targets

SEV Definition Examples Target TTA Target TTR
SEV-1 Major outage / critical SLO breach Resolve 5xx>5%, region down, data integrity risk ≤ 5m ≤ 60m
SEV-2 Partial degradation / single-tenant severe 429% for top tenant >10%, propagation lag p95>15s ≤ 10m ≤ 4h
SEV-3 Minor impact / at risk Canary failing, rising error trend, DLQ growth ≤ 30m ≤ 24h

IC may escalate/downgrade; Security events can be SEV-1 regardless of scope.


Standard Incident Playbook (applies to all)

  1. Declare severity, assign roles, open incident channel & ticket.
  2. Stabilize (protect SLOs): enable rate-limits, switch to long-poll, open breakers per-tenant before global, serve stale data if allowed.
  3. Diagnose via run-of-show:
    • Check golden signals dashboards (error %, p95/p99 latency, throughput).
    • Jump to exemplar traces from SLO tiles.
    • Inspect recent deploys (Argo/PR timestamps), feature flags, policy changes.
  4. Mitigate using scenario runbooks below.
  5. Communicate:
    • Internal: every 15m or on change of state.
    • External: initial notice ≤ 15m for SEV-½; updates 30–60m.
  6. Recover to steady state; back out temporary overrides.
  7. Close with preliminary impact, customer list, follow-ups.
  8. Post-Incident Review within 3–5 business days (template at end).

Scenario Runbooks (diagnostics ➜ actions ➜ exit)

RB-S1 Resolve Latency / Error Spike

Trigger: p99 Resolve > 400 ms or error% > 1% (5m).

Diagnostics

  • Dashboards: Resolve, Redis, CRDB, Gateway.
  • Check cache_hit_ratio, coalesce_ratio, waiter_pool_size.
  • Compare last deploy & policy changes.

Actions

  • Raise in-flight HPA target temporarily (+25–50% pods).
  • If Redis p95>3 ms or evictions>0: scale shards; enable SWR 2s.
  • If hot tenant: lower tenant RPS; increase poll interval via policy obligation.
  • If WS unstable: downgrade to long-poll.

Exit

  • p99 ≤ 250–300 ms for ≥ 10m; error% ≤ 0.2%; roll back any temporary throttles gradually.

RB-S2 Propagation Lag / DLQ Growth

Trigger: propagation p95>8 s (15m) or DLQ>100 (10m).

Diagnostics

  • Orchestrator consumer lag, bus quotas, adapter_throttle_total.

Actions

  • Scale KEDA consumers; increase batch sizes.
  • Replay DLQ after confirming transient errors.
  • If adapter throttled (Azure/AWS): lower batch per binding, increase backoff to 60 s.
  • Limit publish waves (canary ➜ regional).

Exit

  • Lag p95 ≤ 5 s; DLQ drained to steady baseline; no re-enqueue loops.

RB-S3 PDP / AuthZ Degradation

Trigger: PDP decision p99>20 ms, timeouts at edge, spike in denied decisions.

Diagnostics

  • PDP latency and cache hits, policy bundle updates, gateway ext_authz errors.

Actions

  • Scale PDP; warm cache by preloading effective policies.
  • If PDP unreachable: edge uses last-good TTL ≤ 60 s; deny admin ops.
  • Roll back recent policy bundle if regression suspected.

Exit

  • p95 ≤ 5 ms; no authn/z timeouts; revert temp TTLs.

RB-S4 Region Impairment

Trigger: AFD probes failing for region; rising cross-zone errors.

Actions

  • Fail traffic to healthy region via AFD; pause deploys.
  • Switch affected tenants to read-only if their CRDB home region is down.
  • Consider temporary re-pin of tenant data (follow CRDB playbook).

Exit

  • Region restored; backroute to local; resume deploys after smoke tests.

RB-S5 Adapter / Provider Throttling

Trigger: adapter_throttle_total>0, provider 429s.

Actions

  • Reduce per-binding concurrency & batch size; exponential backoff (up to 60 s).
  • Mark binding degraded; notify tenant; queue ops for replay post-clear.

Exit

  • No throttles for 30m; backlog cleared; re-enable normal concurrency.

RB-S6 Security Signal (Break-Glass / Suspicious Access)

Trigger: break-glass token used; abnormal policy denials; spike in auth failures.

Actions

  • IC includes Security On-Call; elevate to SEV-1 if needed.
  • Freeze non-essential changes; enable stricter PDP deny-by-default for admin ops.
  • Export evidence pack; verify audit hash chain; rotate credentials if necessary.

Exit

  • Root cause identified, evidence packaged, access reviewed & restored.

Hotfix Flow (code defects)

When to hotfix: Reproducible bug in latest release that materially impacts SLOs and cannot be mitigated by config/policy.

Steps

  1. Branch from last prod tag: release/vX.Y.Z-hotfix-N.
  2. Minimal, targeted change + unit tests; bump patch version.
  3. CI: build ➜ SBOMsign ➜ push digest.
  4. Stage to staging with canary (5% ➜ 25% ➜ 50%); auto SLO analysis gates.
  5. Change window: If closed, IC invokes PDP override (audited).
  6. Promote to prod-green via Argo Rollouts canary; bake 10–15m.
  7. Flip traffic to green; keep blue warm for immediate rollback.
  8. Open follow-up PR to main with the same patch; prevent drift.

Backout

  • Rollouts auto-rollback on SLO gate failure.
  • Manual: set weights to 0/100 (green/blue), revert env manifest to previous digest.

Config Rollback Drill (blue/green alias)

Prereqs

  • Current alias prod-current ➜ Blue snapshot v1.8.3.
  • Candidate Green v1.9.0 already validated.

Drill (quarterly, scripted)

  1. Announce start in #ops; open mock incident ticket.
  2. Flip alias: ecsctl alias set --tenant t-123 --env prod --alias prod-current --version v1.9.0
  3. Verify:
    • Registry emits ConfigPublished; Orchestrator fan-out started.
    • Sample SDK resolves show new ETag; propagation p95 ≤ 5 s.
  4. Synthetic checks: canary services run health probes against critical keys.
  5. Rollback:
    • ecsctl alias set --tenant t-123 --env prod --alias prod-current --version v1.8.3
    • Confirm re-propagation; no errors.
  6. Record RTO for both directions; attach evidence (events, traces, timings).
  7. Reset to production state; close ticket with drill metrics.

Success criteria

  • Cutover and rollback each ≤ 2 minutes end-to-end; zero 5xx spikes; no policy denials.

RTO / RPO Objectives

Scenario Service/Data RTO RPO Notes
Single pod/node loss All stateless < 2m 0 HPA + PDB recover
AZ loss All < 5m 0 Multi-AZ; zone spread
Region impairment Resolve (read) < 5m 0 AFD failover; long-poll reconnect
Region impairment Writes (tenant home region pinned) < 15m ≤ 5m* *If temporary re-pin of tenant rows
Redis shard failover Cache < 30s 0 Serve stale (SWR), bypass Redis
PDP degraded Decisions < 5m 0 Last-good cache ≤ 60s
CRDB restore (logical) Snapshots/audit ≤ 60m ≤ 5m Point-in-time + hourly export
Object store loss (warm tier) Audit exports ≤ 4h ≤ 24h Rebuild from hot + backups

Communications Templates

Initial External (SEV-½)

We’re investigating degraded performance impacting configuration reads for some tenants starting at HH:MM UTC. Mitigations are in progress. Next update in 30 minutes. Reference: INC-####.

Update

Mitigation active (traffic shifted to Region X). Metrics improving; monitoring continues. ETA to resolution N minutes.

Resolved

Incident INC-#### resolved at HH:MM UTC. Root cause: . We’ll publish a full review within 5 business days.


Post-Incident Review (PIR) Template

  • Summary: what/when/who/impact scope.
  • Customer impact: symptoms, duration, tenants affected.
  • Timeline: detection ➜ declare ➜ mitigation ➜ recovery.
  • Root cause analysis: technical + organizational.
  • What worked / didn’t: detection, runbooks, tooling.
  • Action items: owners, due dates (prevent/mitigate/detect).
  • Evidence pack: dashboards, traces, logs, audit export.
  • Policy updates: SLO/SLA, change windows, guardrails.

Toolbelt (quick refs)

  • Rollouts & traffic
    • kubectl argo rollouts get rollout registry -n prod-ecs-green
    • kubectl argo rollouts promote registry -n prod-ecs-green
  • Envoy/Gateway
    • kubectl get httproute -A | grep registry
    • Adjust weights via Helm values PR or emergency patch (IC approval required).
  • ECSCTL
    • ecsctl alias set … (cutover/rollback)
    • ecsctl refresh broadcast --tenant t-123 --env prod --prefix features/
    • ecsctl audit export --tenant t-123 --from … --to …
  • KEDA & Bus
    • kubectl get scaledobject -A
    • az servicebus queue show … --query messageCount

All emergency patches must be reconciled back to Git within the incident window.


Drills & Readiness

  • Monthly: Config rollback drill (per major tenant).
  • Quarterly: Region failover gameday; Redis shard failover; adapter throttling simulation.
  • Semi-annual: Full disaster recovery restore test (CRDB PITR + audit verify).
  • Track drill RTO/RPO and MTTR trends on Ops dashboard.

Acceptance Criteria (engineering hand-off)

  • On-call rota, playbooks, and comms templates published in the Ops runbook repo.
  • Pager integration wired to SLO burn-rate and key symptom alerts.
  • “Big Red Button” actions scripted: alias flip, WS➜poll downgrade, tenant rate-limit override.
  • Drill automation scripts (ecsctl, Helm helpers) committed and documented.
  • PIR template enforced; incidents cannot be closed without action items and owners.
  • RTO/RPO objectives encoded in DR test plans with last measured values.

Solution Architect Notes

  • Favor per-tenant containment (rate limits, breakers, bindings) to preserve global SLOs.
  • Keep mitigation > diagnosis bias in the first 10 minutes; restore service, then dig deep.
  • Continue enriching runbooks with direct links to dashboards and ready-to-run commands for your platform.
  • Measure runbook MTTA (time-to-action) during drills; shorten with automation and safe defaults.

Business Continuity & DR — geo-replication, failover orchestration, drills, compliance evidence

Objectives

Design and prove a business-continuity strategy that keeps ECS available through AZ/region failures while preserving data integrity, tenant isolation, and regulatory evidence. Define geo-replication, orchestrated failover/failback, regular drills, and auditable proof of meeting RTO/RPO.


Continuity Posture at a Glance

Layer Strategy RTO RPO Notes
API/Gateway Active-active across ≥2 regions (AFD + Envoy) ≤ 5 min 0 Health-based routing; sticky to nearest allowed region
Config Registry data (CRDB) Multi-region cluster, REGIONAL BY ROW (tenant home region) + PITR ≤ 15 min (with re-pin) ≤ 5 min* *0 if home region up; ≤5 min on temporary re-pin
Cache (Redis) Regional clusters, cache as disposable + (Ent: geo-replication) ≤ 30 s 0 Serve stale (SWR) on shard fail; warm on failover
Events (Service Bus) Premium namespaces per geo + DR alias pairing ≤ 10 min ≤ 1 min Alias flip; DLQ preserved
Studio/Static Multi-origin CDN (blue/green buckets) ≤ 5 min 0 Immutable assets
Audit & Exports Hot in CRDB + nightly Parquet to regional object store ≤ 60 min ≤ 24 h (cold) Hot data meets app RPO; archive meets compliance
Policy Bundles Multi-region PDP cache + signed bundles ≤ 5 min 0 Edge “last-good” TTL ≤ 60s

Geo-Replication Topology

flowchart LR
  AFD[Azure Front Door + WAF] --> RG1[Envoy/Gateway - Region A]
  AFD --> RG2[Envoy/Gateway - Region B]

  subgraph Region A
    REG1[Registry]-->CRDB[(CockroachDB)]
    PDP1[PDP]
    ORC1[Orchestrator]-->ASB1[(Service Bus A)]
    WS1[WS/LP Bridge]
    REDIS1[(Redis A)]
  end

  subgraph Region B
    REG2[Registry]-->CRDB
    PDP2[PDP]
    ORC2[Orchestrator]-->ASB2[(Service Bus B)]
    WS2[WS/LP Bridge]
    REDIS2[(Redis B)]
  end

  ASB1 <--DR alias--> ASB2
  CRDB --- Multi-Region Replication --- CRDB
Hold "Alt" / "Option" to enable pan & zoom

Key design points

  • Stateless services are active in all regions; sessionless by design.
  • CRDB hosts a single logical cluster spanning regions; tenant rows are homed to a region for write locality; reads are global.
  • Redis is regional. On failover, caches are rebuilt; no cross-region consistency needed.
  • Service Bus uses Geo-DR alias; producers/consumers bind via alias, not direct namespace name.

Failover Orchestration (Region)

Decision Ladder

  1. Detect: AFD health probes red OR SLO burn-rate breach OR operator declares.
  2. Decide: IC invokes RegionFailoverPlan (automated policy: partial or full).
  3. Act: Route, Data, Events steps (below) in order.
  4. Verify: SLOs green, data path healthy, backlog drained.
  5. Communicate: status page + tenant comms.
  6. Recover/Failback: after root cause fixed and consistency verified.

Orchestration Steps (automated runbook)

sequenceDiagram
  participant Mon as Monitor/SLO
  participant IC as Incident Cmd
  participant AFD as Azure Front Door
  participant GW as Envoy/Gateway
  participant BUS as Service Bus DR
  participant REG as Registry
  participant PDP as Policy
  participant CRDB as CockroachDB

  Mon-->>IC: Region A unhealthy / SLO burn
  IC->>AFD: Disable Region A origins; 100% to Region B
  IC->>BUS: Flip Geo-DR alias to Namespace B
  IC->>REG: Set tenant mode = read-only for homeRegion=A
  alt Extended outage > X min (policy)
    IC->>CRDB: Re-pin affected tenants to Region B (scripted)
    REG-->>PDP: Emit TenantRehomed obligations
  end
  IC->>REG: Trigger warm-up (keyheads) + Refresh broadcast
  Mon-->>IC: SLOs recovered
  IC->>Comms: Resolved update; start failback plan (separate window)
Hold "Alt" / "Option" to enable pan & zoom

Controls

  • Read-only mode: protects writes for tenants whose home region is down; SDKs continue reads via long-poll.
  • Tenant re-pin (optional): migrate leaseholders/zone configs for the tenant partitions to Region B; audited and reversible.

Data Protection & Recovery

  • Backups: CRDB full weekly + incremental hourly; PITR ≥ 7 days.
  • Restore drills: Quarterly logical restores into staging namespace; verify checksums and lineage hashes.
  • Audit chain: hash-chained audit events verified post-restore to prove integrity.
  • Object store: nightly Parquet exports (signed manifest for Ent); regional buckets aligned to residency.

Eventing Continuity (Service Bus)

  • Producer/consumer endpoints use DR alias name.
  • Failover flips alias from Namespace A → B; consumers reconnect automatically.
  • Ordering: not guaranteed across flip; idempotency keys protect replays.
  • DLQ in active namespace is exported pre-flip and re-enqueued post-flip.

Redis Strategy

  • Treat as ephemeral: no cross-region state transfer required.
  • On failover:
    • Serve stale for ≤ 2 s (SWR) during shard promotions.
    • Fire pre-warm job: touch common keyheads (per tenant/app).
    • Observe hit ratio; scale shards if p95 latency > 3 ms or evictions > 0.

Failback (Return to Steady State)

  1. Health gate: region green ≥ 60 min; root cause fixed.
  2. Data: if tenants re-pinned, choose stay or migrate back (off-hours).
  3. Traffic: gradually re-enable AFD weight to original distribution.
  4. Events: DR alias back to primary namespace; drain backlog; compare counts.
  5. Post-ops: remove temporary overrides; finalize incident & PIR.

DR Exercises & Drill Catalog

Drill Scope Frequency Success Criteria
AZ Evacuation Evict a zone; validate PDB, no SLO breach Quarterly p95 latency steady; 0 error spikes
Region Failover Full traffic shift + Bus alias flip Quarterly RTO ≤ 5 min; Propagation p95 ≤ 5 s
Tenant Re-Pin Move sample tenants home region Semi-annual RPO ≤ 5 min; no cross-tenant impact
CRDB PITR Restore Point-in-time restore of config set Semi-annual Data matches lineage hash; audit chain verifies
Redis Shard Loss Kill primary; observe recovery Quarterly RTO ≤ 30 s; SWR served
Policy/PDP Degrade Simulate outage Quarterly Edge uses last-good; admin ops denied

Automation

  • ecsctl dr plan <region> prints executable plan with guardrails.
  • Chaos Mesh/Envoy faults wired to rehearsal scripts.
  • Drill evidence exported automatically (metrics, traces, manifests).

Runbooks (excerpts)

RB-DR1 Region Failover (operator-driven)

  1. Declare SEV-1, assign roles.
  2. afdctl region disable --name weu (AFD origin group) ecsctl bus dr-flip --alias ecs-events --to neu
  3. ecsctl tenant set-mode --home weu --mode read-only
  4. If outage > X min: ecsctl tenant repin --tenant t-*
  5. ecsctl cache warmup --region neu --tenant-sample 100
  6. Verify: Resolve p95, error%, propagation, PDP latency.
  7. Comms: status page update; every 30m until resolved.

RB-DR2 Failback

  1. Confirm region healthy 60m; run smoke tests.
  2. If re-pinned: schedule change window; repin back or keep.
  3. Flip AFD back; bus alias flip; validate metrics.
  4. Close incident; attach evidence pack.

Compliance Evidence (SOC 2 / ISO 27001)

Artifacts produced automatically per drill or real event

  • DR Runbook Execution Log: timestamped steps, operator IDs, commands, and results.
  • SLO Evidence: before/after p95/p99, error %, burn-rate graphs with exemplars.
  • Data Integrity Proof: audit hash-chain verification report; CRDB backup manifest & PITR point.
  • Change Records: AFD config deltas, Bus alias flips, tenant mode changes (all audited).
  • Post-Incident Review: root cause, corrective actions, owners & due dates.

Controls mapping

  • A17 (ISO 27001) / CC7.4 (SOC 2): BC/DR plan, tested with evidence.
  • A12 / CC3.x: Backups & restorations validated and logged.
  • A.5 / CC6.x: Roles & responsibilities during incidents clearly defined.

Observability for BC/DR

  • Spans: ecs.dr.failover.plan, ecs.dr.failover.execute, ecs.dr.failback.execute.
  • Metrics: dr_failover_duration_ms, dr_events_replayed_total, tenants_rehomed_total, afd_origin_healthy.
  • Alerts:
    • Region health red and p95 > 300 ms → Page.
    • DR alias mismatch vs intended → Warn.
    • audit_chain_verify_failures_total > 0 after restore → Page.

Guardrails & Safety Checks

  • Dry-run for all DR commands (diff/preview).
  • Two-person rule for tenant re-pin and bus alias flips.
  • Change windows enforced for failback; failover exempt under SEV-1.
  • All actions idempotent and audited with correlation IDs.

Acceptance Criteria (engineering hand-off)

  • AFD, Envoy, Service Bus aliasing, and CRDB multi-region configured per topology.
  • ecsctl DR subcommands implemented (disable/enable region, bus DR flip, tenant mode, re-pin, cache warmup) with dry-run.
  • Runbooks RB-DR1/RB-DR2 published; drill automation in CI (staging).
  • Quarterly region failover drill executed with recorded RTO/RPO; evidence pack generated and stored.
  • CRDB PITR enabled; restore procedure validated and documented.
  • Alerts and dashboards for DR state live; incident templates linked to BC/DR plan.

Solution Architect Notes

  • Keep tenant re-pin rare and audited; prefer read-only until primary region stabilizes.
  • Treat Redis as rebuildable; don’t pay cross-region cache tax unless a clear SLA needs it.
  • For high-regulation tenants, offer enhanced continuity (dual-home with stricter quorum) as an Enterprise add-on.
  • Rehearse communications as much as technology—clear, timely updates reduce incident impact for customers.

Readiness & Handover — quality gates, checklists, PRR, cutover plan, training notes for Eng/DevOps/Support

Objectives

Ensure ECS can be safely operated in production by codifying quality gates, pre-flight checklists, a Production Readiness Review (PRR), the cutover run-of-show, and role-specific training for Engineering, DevOps/SRE, and Support. Outputs are actionable, auditable, and aligned to ConnectSoft standards (Security-First, Observability-First, Clean Architecture, SaaS multi-tenant).


Quality Gates (build → deploy → operate)

Gate Purpose Evidence / Automation Blocker if Fails
Supply Chain Provenance & integrity Cosign signatures verified in admission; SBOM attached; image pulled by digest
Security Hardening & secrets SAST/dep scan=Clean or accepted; container scan=No High/Critical; mTLS on; secret refs enforced
Contracts API/grpc stability Backward-compat tests; schema compatibility; policy DSL parser tests
Performance Meet p95/p99 Resolve p99 ≤ targets; publish→refresh p95 ≤ SLO; capacity headroom ≥ 2×
Resiliency Degrade gracefully Retry/circuit/bulkhead tests; chaos suite C-1…C-7 green
Observability Trace/metrics/logs ready OTEL spans present; dashboards published; SLO alerts wired; exemplars linkable
Compliance Audit, PII, SoD Audit hash chains; SoD enforced; retention set; evidence pack job runs
FinOps Cost guardrails Usage meters emitting; budgets configured; anomaly jobs active ⚠ (warn)
Ops Runbooks Operability Incident & DR runbooks in repo; drills executed within last 90d
Docs & Support Handover complete Knowledge base articles; FAQ; escalation tree

All ✅ gates are mandatory for promotion from staging → production.


Production Readiness Review (PRR)

Owner: Solution Architect (chair) + SRE + Security + Product. Format: single session with artifact walkthrough; outcomes recorded.

PRR Checklist (submit 48h before meeting)

  • Architecture & Risks
    • Service diagrams current; ADRs for key decisions
    • Threat model updated; mitigations tracked
  • Security & Compliance
    • SBOM, signatures, provenance attached to images
    • Pen-test/DAST findings triaged (no High open)
    • Audit pipeline verified; daily manifest signing (if Ent)
    • PII posture reviewed; DSAR playbook linked
  • Performance & Scale
    • Load test report (p99, throughput, coalescing ratio)
    • HPA/KEDA policies reviewed; surge plan validated
    • Redis sizing workbook & eviction alerts validated
  • Resiliency & DR
    • Chaos results (C-1..C-7) with pass/fail & action items
    • Region failover drill RTO/RPO evidence
  • Observability
    • SLOs encoded; burn-rate alerts live
    • Dashboards: SRE, Tenant Health, Security, Adapters
    • Log redaction tests ✅
  • Operations
    • On-call rota & paging configured; escalation tree
    • Runbooks: latency spike, DLQ growth, authZ degrade
    • Hotfix flow rehearsed; rollback scripts (ecsctl) present
  • Change Management
    • Policy packs reviewed (change windows, approvals)
    • CAB sign-off (if required)
  • Documentation & Support
    • Admin/tenant guides, SDK quickstarts, Studio user guide
    • Support KB: top 20 issues + macros
  • Go/No-Go
    • Go-criteria met (see table)
    • Backout plan validated

Go / No-Go Criteria

Area Go Criteria Evidence
Availability Resolve availability ≥ 99.95% in staging week SLO metric export
Latency Resolve p99 ≤ 150 ms (hit) / ≤ 400 ms (miss) Perf report + dashboards
Security 0 High/Critical vulns exploitable; mTLS enabled Scan reports, config
DR Region failover RTO ≤ 5 min; PITR demo ≤ 60 min Drill logs + evidence pack
Obs All golden dashboards present; alerts firing in test Screenshots/links
Ops On-call & comms templates finalized Pager config + docs
Cost Budgets & alerts active; forecast within plan FinOps dashboard

Pre-Flight Checklists (by domain)

Platform (once per region)

  • AFD origins healthy; WAF policies applied
  • Envoy routes with ext_authz, RLS, traffic splits
  • AKS nodepools spread across 3 AZs; PDBs applied
  • Argo CD and Rollouts healthy; image digests pinned
  • KEDA & HPA metrics sources ready (Prometheus)

Data & Storage

  • CockroachDB multi-region up; leaseholder locality by tenant
  • PITR window configured; backup jobs green
  • Redis cluster shards steady; no evictions; latency p95 < 3 ms

Eventing & Adapters

  • Service Bus DR alias validated (flip test in staging)
  • DLQs empty; replay tested
  • Adapter credentials & quotas verified; watch cursors in sync

Security

  • JWKS rotation tested; last rotation < 90d
  • Secret ref resolution (kvref://) e2e test passes
  • Admission policies (cosign verify, non-root) enforced

Observability

  • OTEL collector pipelines tail-sampling active
  • SLO burn-rate alerts mapped to PagerDuty
  • Audit chain verification job succeeded in last 24h

Cutover Plan (run-of-show)

Timeline (example anchors)

  • T-7d: Staging canary passes SLO gates; PRR complete; tenant comms drafted
  • T-1d: Freeze non-essential changes; final data validation; backup checkpoint
  • T-0: Production green namespace deployed; 5% → 25% → 50% canary with analysis
  • T+30–60m: Flip traffic 100% to green; keep blue hot for rollback (2–24h window)
  • T+24h: Post-cutover review; decommission blue after sign-off

Roles

  • IC (lead), Ops Lead, Service SME, Comms Lead, Scribe

Steps (T-0 detailed)

  1. Announce start; open bridge & incident room (change record)
  2. Verify prerequisites (see Pre-Flight) and go/no-go check
  3. Argo Rollouts canary start; watch SLO analysis; hold at 50% for 10–15m
  4. Promote to 100%; observe p95/p99, error %, propagation lag
  5. Trigger config canary on limited tenants (prod-next alias) if planned
  6. Flip alias to prod-current; validate refresh events; sample SDK resolve shows new ETag
  7. Hypercare window: monitor, keep blue ready; publish customer update
  8. Close with initial success metrics; schedule 24h review

Backout (any time)

  • Service: set weights to 0/100 (green/blue), sync prev digest
  • Config: revert alias pointer to last known good version
  • Comms: issue backout notice; continue root-cause work

Handover Package (what Engineering delivers to Ops/Support)

Artifact Description
Service Runbook Start/stop, health checks, dependencies, common faults
Resiliency Profile Timeouts, retries, backoff, bulkheads, circuits (effective values)
SLO/SLA Sheet SLIs, targets, alert policies, escalation paths
Dashboards Index Links to Golden Signals, Tenant Health, Security, Adapters
Config Reference Default values, schema refs, policy overlays per edition
Playbooks Incident scenarios, DR steps, hotfix flow
Release Notes Known issues, feature flags, migration notes
Test Evidence Perf, chaos, DR drills, security scans
Compliance Evidence Audit manifest sample, retention policy, PII posture
Contact Matrix SMEs by component, backup ICs, vendor contacts

All artifacts live in ops/runbooks, ops/evidence, and tenant-safe copies in Support KB.


Training Notes (role-based)

Engineering

  • Goal: Own code in prod responsibly.
  • Modules
    • Architecture & domain model (2h)
    • Resiliency & chaos toolkit (90m)
    • Observability deep-dive: tracing taxonomies, SLOs (90m)
    • Secure coding + secret refs (60m)
    • Release & hotfix procedures (60m)
  • Labs
    • Break and fix: inject 200 ms Redis latency → verify degrade path
    • Canary failure drill with Argo Rollouts

DevOps / SRE

  • Goal: Operate at SLO; minimize MTTR.
  • Modules
    • DR & failover orchestration (2h)
    • Capacity & autoscaling (HPA/KEDA) (90m)
    • FinOps dashboards & budgets (60m)
    • Compliance evidence generation (45m)
  • Labs
    • Region failover simulation; measure RTO/RPO
    • DLQ growth → replay & recovery

Support / CSE

  • Goal: First-line diagnosis & clear comms.
  • Modules
    • Studio & API basics; common errors (90m)
    • Reading dashboards; when to escalate (60m)
    • Tenant cost & quota coaching (45m)
  • Macros & KB
    • 304 vs 200 ETag explainer, rate-limit guidance, policy denial decoding, rollback steps
  • Escalation
    • SEV thresholds, IC paging, evidence to capture (tenant, route, x-correlation-id)

Day-0 / Day-1 / Day-2 Operations

  • Day-0 (Cutover): execute plan; hypercare (24h); record metrics snapshot
  • Day-1 (Stabilize): finalize blue tear-down; adjust HPA/KEDA; clear temporary overrides
  • Day-2+ (Operate): weekly perf+cost review; monthly rollback drill; quarterly DR/chaos gameday

Checklists (copy-paste ready)

Go/No-Go (final 1h)

  • All SLO tiles green 30m
  • No High/Critical CVEs in diff since staging
  • DR alias & AFD health verified
  • Pager test fired to on-call
  • Backout commands pasted & dry-run output saved

Post-Cutover (T+60m)

  • Error% ≤ baseline; p95/p99 ≤ targets
  • Propagation p95 ≤ 5s; DLQ normal
  • Customer comms sent; status page updated
  • Blue kept hot; timer set for 24h review

Acceptance Criteria (engineering hand-off)

  • PRR completed with Go decision; artifacts archived.
  • All Quality Gates enforced in CI/CD and verified in admission.
  • Pre-flight checklists executed and recorded per region.
  • Cutover plan executed with evidence bundle (metrics, traces, audit exports).
  • Handover package delivered to Ops/Support; training sessions completed and tracked.
  • Backout tested; RTO/RPO measured and logged.

Solution Architect Notes

  • Keep runbooks short and command-first; link deep docs rather than embedding.
  • Treat training as a product: measure retention via quarterly drills and refreshers.
  • Automate evidence packs after PRR, cutover, and drills—compliance should be a by-product of good operations.