📄 External Configuration System - Solution Design Document¶
SDS-MVP Overview – External Configuration System (ECS)¶
🎯 Context¶
The External Configuration System (ECS) is ConnectSoft’s Configuration-as-a-Service (CaaS) solution. It provides a centralized, secure, and tenant-aware configuration backbone for SaaS and microservice ecosystems, built on ConnectSoft principles of Clean Architecture, DDD, cloud-native design, event-driven mindset, and observability-first execution.
ECS addresses the operational complexity of managing configuration at scale, offering:
- Centralized registry with versioning and rollback.
- Tenant and edition-aware overlays.
- Refreshable and event-driven propagation.
- Multi-provider adapters (Azure AppConfig, AWS AppConfig, Redis, SQL/CockroachDB, Consul).
- REST/gRPC APIs, SDKs, and a Config Studio UI.
📦 Scope of MVP¶
The MVP will focus on delivering the core configuration lifecycle:
-
Core Services
- Config Registry: Authoritative store of configuration items, trees, and bundles.
- Policy Service: Tenant, edition, and environment overlays with validation rules.
- Refresh Service: Event-driven propagation to SDKs and client services.
- Adapter Hub: Extensible connectors to external providers.
-
Access Channels
- APIs: REST + gRPC endpoints with OpenAPI/gRPC contracts.
- SDKs: .NET, JS, Mobile libraries with caching and refresh subscription.
- Config Studio: UI for admins, developers, and viewers with approval workflows.
-
Foundational Layers
- Security-first: OIDC/OAuth2, RBAC, edition-aware scoping, secret management.
- Observability-first: OTEL spans, metrics, structured logs, trace IDs.
- Cloud-native: Containerized, autoscaling, immutable deployments.
- Event-driven: CloudEvents-based change propagation.
🧑🤝🧑 Responsibilities by Service¶
| Service | Responsibility |
|---|---|
| Config Registry | CRUD, versioning, history, rollback of configuration. |
| Policy Service | Edition/tenant overlays, RBAC enforcement, schema validation. |
| Refresh Service | Emits ConfigChanged events, ensures idempotent propagation. |
| Adapter Hub | Provider adapters for Azure AppConfig, AWS AppConfig, Redis, SQL, Consul. |
| Gateway/API | Auth, routing, rate limits, tenant scoping. |
| SDKs | Local caching, refresh subscription, fallback logic. |
| Config Studio | Admin UX, workflow approvals, auditing, policy management. |
🗺️ High-Level Service Map¶
flowchart TD
ClientApp[Client Apps / SDKs]
Studio[Config Studio UI]
Gateway[API Gateway]
Registry[Config Registry]
Policy[Policy Service]
Refresh[Refresh Service]
Adapter[Adapter Hub]
ClientApp -->|REST/gRPC| Gateway
Studio -->|Admin APIs| Gateway
Gateway --> Registry
Gateway --> Policy
Registry --> Refresh
Refresh --> ClientApp
Registry --> Adapter
Adapter --> External[Azure/AWS/Redis/SQL/Consul]
🏢 Tenant and Edition Model¶
-
Multi-Tenant Isolation: Every tenant has an isolated configuration namespace, scoped across DB, cache, and API.
-
Edition Overlays: Global → Edition → Tenant → Service → Instance inheritance chain. Editions (e.g., Free, Pro, Enterprise) overlay baseline config with restrictions and feature toggles.
-
Environments: Dev, Stage, Prod environments are first-class entities, enforced via approval workflows in Studio.
-
RBAC Roles:
- Admin: Full control of config + secrets.
- Developer: Can manage configs but not override edition-level rules.
- Viewer: Read-only access.
📊 Quality Targets¶
| Area | Target |
|---|---|
| Availability | 99.9% SLA (single-region MVP), roadmap for multi-region HA. |
| Latency | < 50ms config retrieval via API or SDK cache. |
| Scalability | 100 tenants, 10k config items per tenant in MVP scope. |
| Security | Zero Trust, tenant isolation, no cross-tenant leaks. |
| Observability | 100% trace coverage with traceId, tenantId, editionId. |
| Resilience | Local SDK caching, retry + at-least-once event delivery. |
| Extensibility | Modular provider adapters and schema-based validation. |
✅ Solution Architect Notes¶
- MVP services must be scaffolded using ConnectSoft.MicroserviceTemplate for uniformity.
- Registry persistence in CockroachDB with tenant-aware schemas.
- Hot-path caching via Redis with per-tenant keyspaces.
- API Gateway secured with OpenIddict / OAuth2 and workload identities.
- Events propagated via MassTransit over Azure Service Bus (default) with DLQ and replay.
- SDKs should include offline-first fallback for resilience.
- Next solution cycles should expand into API contracts, event schemas, and deployment topologies.
Service Decomposition & Boundaries — ECS Core¶
This section defines the solution‑level decomposition and interaction boundaries for ECS MVP services: Config Registry, Policy Engine, Refresh Orchestrator, Adapter Hub, Config Studio UI, Auth, and API Gateway. It is implementation‑ready for Engineering/DevOps teams using ConnectSoft templates.
🧩 Components & Responsibilities¶
| Component | Primary Responsibilities | Owns | External Calls | Exposes |
|---|---|---|---|---|
| API Gateway | Routing, authN/authZ enforcement, rate limiting, tenant routing, request shaping, canary/blue‑green | N/A | Auth, Core services | REST/gRPC north‑south edge |
| Auth (OIDC/OAuth2) | Issuer, scopes, service‑to‑service mTLS, token introspection, role->scope mapping | Identity config | N/A | OIDC endpoints, JWKS |
| Config Registry | CRUD of config items/trees/bundles; immutable versions; diffs; rollback; audit append | Registry schema in CockroachDB; audit log | Redis (read‑through), Event Bus (emit), DB | REST/gRPC: configs, versions, diffs, rollback |
| Policy Engine | Schema validation (JSON Schema), edition/tenant/env overlays, approval policy evaluation | Policy rules, schemas | Registry (read), Event Bus (emit on policy change) | REST: validate, resolve preview; Policy mgmt APIs |
| Refresh Orchestrator | Emits ConfigPublished events; fan‑out invalidations; long‑poll/WebSocket channel mgmt | Delivery ledger (idempotency), consumer offsets | Event Bus (consume/produce), Redis (invalidate) | Event streams; refresh channels |
| Adapter Hub | Pluggable connectors to Azure AppConfig/AWS AppConfig/Consul/Redis/SQL; mirror/sync | Adapter registrations; provider cursors | Providers; Registry (write via API), Bus (emit) | REST: adapter mgmt; background sync jobs |
| Config Studio UI | Admin UX, drafts, reviews, approvals, diffs, audit viewer, policy editor | N/A (UI only) | Gateway, Auth | Browser app; Admin APIs via Gateway |
Tech Baseline (all core services): .NET 9, ConnectSoft.MicroserviceTemplate, OTEL, HealthChecks, MassTransit, Redis, CockroachDB (NHibernate for Registry, Dapper allowed in Adapters).
🧭 Bounded Contexts & Ownership¶
| Bounded Context | Aggregate Roots | Key Invariants | Who Can Change |
|---|---|---|---|
| Registry | ConfigItem, ConfigVersion, Bundle, AuditEntry |
Versions are append‑only; rollback creates a new version; paths unique per (tenant, env) |
Studio (human), Adapter Hub (automated), CI (automation) |
| Policy | PolicyRule, Schema, EditionOverlay |
Resolve = Base ⊕ Edition ⊕ Tenant ⊕ Service ⊕ Instance; schema must validate pre‑publish | Studio (policy admins) |
| Propagation | Delivery, Subscription |
At‑least‑once; idempotency by (tenant, path, version, etag) |
Refresh Orchestrator |
| Adapters | ProviderRegistration, SyncCursor, Mapping |
No direct DB writes; must use Registry APIs; deterministic mapping | Adapter Hub |
🔐 Tenancy & Scope Enforcement¶
| Layer | How Enforced | Notes |
|---|---|---|
| Gateway | JWT claims: tenant_id, scopes[]; per‑tenant rate limits; route constraints |
Reject cross‑tenant routes |
| Services | Multi‑tenant keyspace (tenant:{id}:...) for cache; DB row filters; repository guards |
Tenancy middleware in template |
| Events | CloudEvents extensions: tenantId, editionId, environment |
Dropped if missing/invalid |
🔄 Interaction Rules (Who May Call Whom)¶
- Studio UI → Gateway → {Registry, Policy}
- Adapters → Gateway → Registry (never direct DB access)
- Policy ↔ Registry (read for validation & preview only; Registry calls Policy for validation)
- Registry → Refresh Orchestrator (emit) → Event Bus → SDKs/Consumers
- Orchestrator → Redis (targeted invalidations)
- Gateway ↔ Auth (token validation/JWKS)
Forbidden: Cross‑context writes (e.g., Policy writing Registry tables), Adapters writing DB directly, services bypassing Gateway for north‑south.
📡 Contracts — Endpoints & Events (MVP surface)¶
REST (Gateway‑fronted)
POST /configs/{path}:save-draftPOST /configs/{path}:publishGET /configs/{path}:resolve?version|latest&env&service&instance(ETag, If‑None‑Match)GET /configs/{path}/diff?fromVersion&toVersionPOST /policies/validate(body: config draft + context)PUT /policies/rules/{id}/PUT /policies/editions/{editionId}POST /adapters/{providerId}:sync/GET /adapters
gRPC (low‑latency SDK)
ResolveService.Resolve(ConfigResolveRequest)→ stream / unaryRefreshChannel.Subscribe(Subscription)→ server stream
CloudEvents (Event Bus topics)
ecs.config.v1.ConfigDraftSavedecs.config.v1.ConfigPublishedecs.policy.v1.PolicyUpdatedecs.adapter.v1.SyncCompletedecs.refresh.v1.CacheInvalidated
Event fields (common): id, source, type, time, dataRef?; ext: tenantId, editionId, environment, path, version, etag.
🗺️ Component Diagram¶
flowchart LR
subgraph Edge
GW[API Gateway]
AUTH[Auth (OIDC)]
end
subgraph Core
REG[Config Registry]
POL[Policy Engine]
REF[Refresh Orchestrator]
ADP[Adapter Hub]
end
SDK[SDKs/.NET JS Mobile]
UI[Config Studio UI]
BUS[(Event Bus)]
REDIS[(Redis)]
CRDB[(CockroachDB)]
EXT[Ext Providers: Azure/AWS/Consul/Redis/SQL]
UI-->GW
SDK-->GW
GW-->REG
GW-->POL
REG-- read/write -->CRDB
REG-- emit -->REF
REG-- hot read -->REDIS
REF-- publish -->BUS
BUS-- notify -->SDK
REF-- targeted del -->REDIS
ADP-- API -->GW
ADP-- sync -->EXT
GW-- JWT/JWKS -->AUTH
📚 Key Sequences¶
1) Resolve (read hot path)¶
sequenceDiagram
participant SDK
participant GW
participant REG as Registry
participant RED as Redis
participant POL as Policy
SDK->>GW: GET /configs/{path}:resolve (If-None-Match: etag)
GW->>RED: GET tenant:path:etag
alt Cache hit & ETag matches
RED-->>GW: 304
GW-->>SDK: 304 Not Modified
else Miss or ETag mismatch
GW->>REG: resolve(path, ctx)
REG->>POL: validate+overlay(draft/version, ctx)
POL-->>REG: resolved value + etag
REG->>RED: SET tenant:path -> value, etag, ttl
REG-->>GW: 200 + value + ETag
GW-->>SDK: 200 + value + ETag
end
2) Publish (write warm path)¶
sequenceDiagram
participant UI as Studio UI
participant GW
participant REG as Registry
participant REF as Orchestrator
participant BUS as Event Bus
participant RED as Redis
UI->>GW: POST /configs/{path}:publish (Idempotency-Key)
GW->>REG: publish(path, draftId)
REG-->>REG: create immutable version, append audit
REG->>RED: DEL tenant:path:* (scoped)
REG->>REF: emit ConfigPublished(tenant, path, version, etag)
REF->>BUS: publish CloudEvent
BUS-->>SDK: wake/notify
UI-->>UI: 200 Published(version, etag)
3) Adapter Sync (ingest)¶
sequenceDiagram
participant ADP as Adapter Hub
participant EXT as Ext Provider
participant GW
participant REG as Registry
ADP->>EXT: fetch changes(since cursor)
EXT-->>ADP: items delta
ADP-->>ADP: map/transform (deterministic)
ADP->>GW: POST /configs/{path}:save-draft (service identity)
ADP->>GW: POST /configs/{path}:publish
GW->>REG: publish(...)
REG-->>ADP: ack + new cursor
📈 NFRs by Service (Solution Targets)¶
| Service | Latency | Throughput | Scaling | Storage |
|---|---|---|---|---|
| Gateway | < 5ms overhead | 5k RPS | HPA by RPS & p99 | N/A |
| Registry | p99 < 20ms (cached), < 50ms (DB) | 2k RPS read / 200 RPS write | HPA by CPU + DB queue | CockroachDB |
| Policy | p99 < 15ms | 1k RPS | Co‑locate with Registry; cache schemas | Internal store |
| Orchestrator | < 100ms publish | 5k msg/s | Scale by bus lag (KEDA) | Small ledger |
| Adapter Hub | N/A interactive | bursty sync | Per‑adapter workers | Provider cursors |
| Redis | p99 < 2ms | >50k ops/s | Clustered shards | In‑memory |
🚦 Failure Modes & Backpressure¶
| Scenario | Behavior | Operator Signal |
|---|---|---|
| Registry DB degraded | Serve from Redis cache; reject writes with 503 RETRY |
Alert: p99 DB > threshold |
| Event bus outage | Buffer in Orchestrator (bounded); degrade to polling | Alert: bus lag + DLQ growth |
| Adapter provider slow | Backoff + skip tenant slice; do not block core | Adapter sync error rate |
| Policy validation fail | Block publish; keep draft; return violations | Policy violation dashboard |
| Cache stampede | Single‑flight per (tenant,path); stale‑while‑revalidate |
Cache hit ratio dip |
Idempotency for publishes: Idempotency-Key header → (tenant, path, hash(body)) stored for TTL to prevent duplicates.
🚀 Deployability & Operability¶
- One container per service, health probes:
/health/live,/health/ready; startup probe for Registry migrations. - Config via ECS itself (bootstrapped defaults from environment), secrets via KeyVault/Key Management.
- Dashboards: golden signals per service; propagation lag, cache hits, publish failure rate.
- Policies as code: policy repo with CI validation; signed artifacts promoted with environment gates.
✅ Solution Architect Notes¶
- Prefer NHibernate in Registry for aggregate invariants; use Dapper inside Adapter workers for raw mapping speed.
- Default bus: Azure Service Bus; keep RabbitMQ option in template flags.
- SDKs ship with long‑poll by default; enable WS later behind a feature flag.
- Hard rule: Adapters must use public APIs (no private shortcuts) to keep invariants centralized in Registry.
- Next, solidify OpenAPI/gRPC contracts and CloudEvents schemas to unblock parallel implementation.
API Gateway & AuthN/AuthZ — Edge Design for ECS¶
The ECS edge enforces identity, tenant isolation, and traffic governance before requests reach core services. This section defines the Envoy/YARP gateway design, OIDC flows, scope/role model, tenant routing, and rate limiting used by:
- Config Studio (SPA)
- Public ECS APIs/SDKs (REST/gRPC)
- Service-to-service (internal)
- Adapters & webhooks
Outcomes: Zero‑trust ingress, edition-aware routing, consistent scopes, and predictable throttling per tenant/plan.
Edge Components & Trust Boundaries¶
flowchart LR
subgraph PublicZone[Public Zone]
Client[Browsers/SDKs/CLI]
WAF[WAF/DoS Shield]
Envoy[Envoy Gateway]
end
subgraph ServiceZone[Service Mesh / Internal Zone]
YARP[YARP Internal Gateway]
AuthZ[Ext AuthZ/Policy PDP]
API[Config API]
Policy[Policy Engine]
Audit[Audit Service]
Events[Event Publisher]
end
IdP[(OIDC Provider)]
JWKS[(JWKS Cache)]
RLS[(Rate Limit Service)]
Client --> WAF --> Envoy
Envoy -->|JWT verify| JWKS
Envoy -->|ext_authz| AuthZ
Envoy -->|quota| RLS
Envoy --> YARP
YARP --> API
YARP --> Policy
YARP --> Audit
Envoy <-->|OAuth/OIDC| IdP
Patterns
- Envoy is the internet-facing gateway (JWT, mTLS upstream, rate limit, ext_authz).
- YARP is the east–west/internal router (BFF for Studio, granular routing, canary, sticky reads).
- AuthZ PDP centralizes fine-grained decisions (scope→action→resource→tenant) using PDP/OPA or Policy Engine APIs.
OIDC Flows & Token Types¶
| Client/Use Case | Grant / Flow | Token Audience (aud) |
Notes |
|---|---|---|---|
| Config Studio (SPA) | Auth Code + PKCE | ecs.api |
Implicit denied; refresh via offline_access (rotating refresh). |
| Machine-to-Machine (services) | Client Credentials | ecs.api |
Used by backend jobs, adapters, CI/CD. |
| SDK in customer service | Token Exchange (RFC8693) or Client Credentials | ecs.api |
Exchange SaaS identity→ECS limited token; preserves act chain. |
| Support/Break-glass (time-box) | Device Code / Auth Code | ecs.admin |
Short TTL, extra policy gates + audit. |
| Webhooks / legacy integrators | HMAC API Key (fallback) | n/a | Signed body + timestamp; scoped to tenant + resources; rotateable. |
Sequence: SPA (Auth Code + PKCE)
sequenceDiagram
participant B as Browser (SPA)
participant G as Envoy
participant I as OIDC Provider
participant Y as YARP
participant A as Config API
B->>G: GET /studio
G-->>B: 302 → /authorize (PKCE)
B->>I: /authorize?code_challenge...
I-->>B: 302 → /callback?code=...
B->>G: /callback?code=...
G->>I: /token (code+verifier)
I-->>G: id_token + access_token(jwt) + refresh_token
G->>Y: /api/configs (Authorization: Bearer ...)
Y->>A: forward (mTLS)
A-->>Y: 200
Y-->>G: 200
G-->>B: 200 + content
Sequence: S2S (Client Credentials)
sequenceDiagram
participant Job as Worker/Adapter
participant I as OIDC Provider
participant G as Envoy
participant P as Policy Engine
participant A as Config API
Job->>I: POST /token (client_credentials, scopes)
I-->>Job: access_token(jwt)
Job->>G: API call + Bearer jwt
G->>P: ext_authz {sub, scopes, tenant_id, action, resource}
P-->>G: ALLOW (obligations: edition, rate_tier)
G->>A: forward (mTLS)
A-->>G: 200
G-->>Job: 200
Scopes, Roles & Permissions¶
Scope Catalogue (prefix ecs.)¶
| Scope | Purpose | Sample Actions |
|---|---|---|
config.read |
Read effective/declared config | GET /configs, /resolve |
config.write |
Modify config (CRUD, rollout) | POST/PUT /configs, /rollback |
policy.read |
Read policy artifacts | GET /policies |
policy.write |
Manage policies, schemas | PUT /policies, /schemas |
audit.read |
Read audit & diffs | GET /audit/* |
snapshot.manage |
Import/export snapshots | POST /snapshots/export |
adapter.manage |
Manage provider connectors | POST /adapters/* |
tenant.admin |
Tenant-level admin ops | Keys, members, editions |
Scopes are granted per tenant, optional qualifiers:
tenant:{id},env:{name},app:{id},region:{code}.
Role→Scope Mapping (default policy)¶
| Role | Scopes |
|---|---|
tenant-reader |
config.read, audit.read |
tenant-editor |
config.read, config.write, audit.read |
tenant-admin |
All above + policy.read, policy.write, snapshot.manage, adapter.manage, tenant.admin |
platform-admin |
Cross-tenant (requires x-tenant-admin=true claim + break-glass policy) |
support-operator |
Time-boxed, read-most, write via approval policy |
Claims required on JWT
sub,iss,aud,exp,iattenant_id(ortenantsarray for delegated tools)edition_id,env,plan_tierscopes(space-separated)- Optional:
act(actor chain),delegated=true,region,app_id
Decision rule: ALLOW = jwt.valid ∧ aud∈{ecs.api,ecs.admin} ∧ scope→action ∧ resource.scope.includes(tenant_id/env/app) ∧ edition.allows(action)
Tenant & Edition Routing¶
Resolution order (first hit wins):
- Host-based:
{tenant}.ecs.connectsoft.cloud→tenant_id={tenant} - Header:
X-Tenant-Id: {tenant_id}(required for M2M/SDK if no host binding) - Path prefix:
/t/{tenant_id}/...(CLI & bulk ops)
Validation
- Verify tenant exists & active in Tenant Registry.
- Map edition and plan tier to policy & rate tiers.
- Enforce data residency: route to region cluster (EU/US/APAC).
- Attach routing headers:
x-tenant-id,x-edition-id,x-plan-tier,x-region.
Envoy route (illustrative)
- match: { prefix: "/api/" }
request_headers_to_add:
- header: { key: x-tenant-id, value: "%REQ(X-Tenant-Id)%" }
typed_per_filter_config:
envoy.filters.http.jwt_authn:
requirement_name: ecs_jwt
route:
cluster: ecs-region-%DYNAMIC_REGION_FROM_TENANT%
YARP internal routes (illustrative)
{
"Routes": [
{ "RouteId": "cfg", "Match": { "Path": "/api/configs/{**catch}" }, "ClusterId": "config-api" },
{ "RouteId": "pol", "Match": { "Path": "/api/policies/{**catch}" }, "ClusterId": "policy-api" }
],
"Clusters": {
"config-api": { "Destinations": { "d1": { "Address": "https://config-api.svc.cluster.local" } } },
"policy-api": { "Destinations": { "d1": { "Address": "https://policy-api.svc.cluster.local" } } }
}
}
Rate Limiting & Quotas¶
Algorithms: Token Bucket at Envoy (global), local leaky-bucket per worker; descriptors include tenant_id, scope, route, plan_tier.
Default Tiers (edition-aware)¶
| Edition | Read RPS (burst) | Write RPS (burst) | Events/min | Webhook deliveries/min | Notes |
|---|---|---|---|---|---|
| Starter | 150 (600) | 10 (30) | 600 | 120 | Best-effort propagation |
| Pro | 600 (2,400) | 40 (120) | 3,000 | 600 | Priority queueing |
| Enterprise | 2,000 (8,000) | 120 (360) | 12,000 | 2,400 | Dedicated partitions, regional failover priority |
429 Behavior
Retry-Afterheader with bucket ETA.- Audit an
RateLimitExceededevent (tenant-scoped). - SDKs respect backoff
jittered-expoand honorRetry-After.
Envoy RLS descriptors
domain: ecs
descriptors:
- key: tenant_id
descriptors:
- key: scope
rate_limit:
unit: second
requests_per_unit: 600 # Pro read default
Security Controls at Edge¶
- JWT verification at Envoy (per-tenant JWKS cache,
kidpinning, TTL ≤ 10m). - ext_authz callout to PDP for ABAC/RBAC decision + obligations (edition, plan).
- mTLS between Envoy↔YARP↔services (SPIFFE/SPIRE or workload identity).
- CORS locked to Studio & allowed origins (configurable per tenant).
- HMAC API Keys for webhooks: header
X-ECS-Signature: sha256=...over canonical payload + ts. - WAF/DoS: IP reputation, geo-fencing by residency, request size caps, JSON schema guard on public write endpoints.
- Key rotation: JWKS rollover, client secret rotation, API key rotation (max 90d).
Error Model & Observability¶
Errors
401invalid/expired tokens →WWW-Authenticate: Bearer error="invalid_token".403valid token but insufficient scope/tenant.429throttled withRetry-After.400schema violations;415content-type errors.
Problem Details (RFC 7807): return type, title, detail, traceId, tenant_id.
Telemetry
- OTEL spans at Envoy & YARP; propagate
traceparent,x-tenant-id. - Metrics:
requests_total{tenant,route,scope},authz_denied_total,jwks_refresh_seconds,429_total. - Logs: structured, redacted; audit who/what/when/why for authz decisions.
Configuration-as-Code (GitOps)¶
- Envoy & YARP configs templated, per environment & region; changes gate via PR + canary.
- Policy bundles versioned (OPA or Policy Engine exports).
- Rate-limit tiers read from ECS system config (edition-aware) with hot reload.
Reference Implementations¶
Envoy (JWT + ext_authz + RLS excerpt)¶
http_filters:
- name: envoy.filters.http.jwt_authn
typed_config:
providers:
ecs:
issuer: "https://id.connectsoft.cloud"
audiences: ["ecs.api","ecs.admin"]
remote_jwks: { http_uri: { uri: "https://id.connectsoft.cloud/.well-known/jwks.json", cluster: idp }, cache_duration: 600s }
rules:
- match: { prefix: "/api/" }
requires: { provider_name: "ecs" }
- name: envoy.filters.http.ext_authz
typed_config:
grpc_service: { envoy_grpc: { cluster_name: pdp } }
transport_api_version: V3
failure_mode_allow: false
- name: envoy.filters.http.ratelimit
typed_config:
domain: "ecs"
rate_limit_service: { grpc_service: { envoy_grpc: { cluster_name: rls } } }
YARP (authZ pass-through + canary)¶
{
"ReverseProxy": {
"Transforms": [
{ "RequestHeaderOriginalHost": "true" },
{ "X-Forwarded": "Append" }
],
"SessionAffinity": { "Policy": "Cookie", "AffinityKeyName": ".ecs.aff" },
"HealthChecks": { "Passive": { "Enabled": true } }
}
}
SDK & Client Guidance (Edge Contracts)¶
- Send tenant: Prefer host-based; otherwise send
X-Tenant-Id. - Set audience:
aud=ecs.api; request only necessary scopes. - Respect 429: backoff with jitter; honor
Retry-After. - Propagate trace: include
traceparentfor cross-service correlation. - Cache: ETag/If-None-Match on read endpoints.
Operational Runbook (Edge)¶
- Rollover JWKS: stage new keys, overlap 24h, monitor
jwt_authn_failed. - Policy hotfix: push PDP bundle; verify
authz_denied_totalregression. - Canary: 5% traffic to new Envoy/YARP images; roll forward on SLO steady 30m.
- Tenant cutover (region move): drain + update registry → immediate routing update.
- Incident: spike in 429 → inspect descriptor stats; temporarily bump burst for affected tenant via override CRD.
Solution Architect Notes¶
- Decision pending: PDP choice (OPA sidecar vs. centralized Policy Engine API). Prototype both; target <5 ms p95 decision time.
- Token exchange: Add formal RFC 8693 support to IdP for SDK delegation chains (
actclaim preservation). - Multi‑IdP federation: Map external AAD/Okta groups → ECS roles via claims transformation at IdP.
- Per‑tenant custom domains: Support
config.{customer}.comvia TLS cert automation (ACME) and SNI routing. - Quota analytics: Expose per‑tenant usage API (current window, limit, projected exhaustion) to Studio.
This design enables secure, tenant-aware, edition-governed access with predictable performance and operational clarity from the very first release.
Public REST API (OpenAPI) – endpoints, DTOs, error model, idempotency, pagination, filtering¶
Scope & positioning¶
This section specifies the public, multi-tenant REST API for ECS v1, including resource model, endpoint surface, DTOs, error envelope, and cross‑cutting behaviors (auth, idempotency, pagination, filtering, ETags). It is implementation‑ready for Engineering/SDK agents and aligns with Gateway/Auth design already defined.
Principles
- Versioned URIs (
/api/v1/...) and semantic versions inx-api-versionresponse header. - Tenant‑scoped by token (preferred) with optional admin paths that accept
{tenantId}. - OAuth2/OIDC (JWT Bearer) with scopes:
ecs.read,ecs.write,ecs.admin. - Problem Details (RFC 7807) with ECS extensions for all non‑2xx results.
- ETag + If‑Match for optimistic concurrency on mutable resources.
- Idempotency-Key for POST that create/trigger side effects.
- Cursor pagination, stable sorting, safe filtering DSL.
- Observability-first:
traceparent,x-correlation-id,x-tenant-id(echo),x-request-id.
Core resource model¶
| Resource | Purpose | Key fields |
|---|---|---|
| ConfigSet | Logical configuration package (name, app, env targeting, tags) | id, name, appId, labels[], status, createdAt, updatedAt, etag |
| ConfigItem | Key/value (JSON/YAML) entries within a ConfigSet | key, value, contentType, isSecret, labels[], etag |
| Snapshot | Immutable, signed version of a ConfigSet | id, configSetId, version, createdBy, hash, note |
| Deployment | Promotion of a Snapshot to an environment/segment | id, snapshotId, environment, status, policyEval, startedAt, completedAt |
| Policy | Targeting, validation, transform rules | id, kind, expression, enabled |
| Refresh | Push/notify clients to reload (per set/tenant/app) | id, scope, status |
Endpoint surface (v1)¶
All paths are relative to
/api/v1. Admin endpoints are prefixed with/admin/tenants/{tenantId}when cross‑tenant ops are required.
Config sets¶
| Method | Path | Scope | Notes |
|---|---|---|---|
| POST | /config-sets |
ecs.write |
Create (idempotent via Idempotency-Key) |
| GET | /config-sets |
ecs.read |
List with cursor pagination, filtering, sorting |
| GET | /config-sets/{setId} |
ecs.read |
Retrieve |
| PATCH | /config-sets/{setId} |
ecs.write |
Partial update (If‑Match required) |
| DELETE | /config-sets/{setId} |
ecs.write |
Soft delete (If‑Match required) |
Items (within a set)¶
| Method | Path | Scope | Notes |
|---|---|---|---|
| PUT | /config-sets/{setId}/items/{key} |
ecs.write |
Upsert single (If‑Match optional on update) |
| GET | /config-sets/{setId}/items/{key} |
ecs.read |
Get single |
| DELETE | /config-sets/{setId}/items/{key} |
ecs.write |
Delete single (If‑Match required) |
| POST | /config-sets/{setId}/items:batch |
ecs.write |
Bulk upsert/delete; idempotent with key |
| GET | /config-sets/{setId}/items |
ecs.read |
List/filter keys, supports accept: application/yaml |
Snapshots & diffs¶
| Method | Path | Scope | Notes |
|---|---|---|---|
| POST | /config-sets/{setId}/snapshots |
ecs.write |
Create snapshot (idempotent via content hash) |
| GET | /config-sets/{setId}/snapshots |
ecs.read |
List versions (cursor) |
| GET | /config-sets/{setId}/snapshots/{snapId} |
ecs.read |
Get snapshot metadata |
| GET | /config-sets/{setId}/snapshots/{snapId}/content |
ecs.read |
Materialized config (JSON/YAML) |
| POST | /config-sets/{setId}/diff |
ecs.read |
Compute diff of two snapshots or working set |
Deployments & refresh¶
| Method | Path | Scope | Notes |
|---|---|---|---|
| POST | /deployments |
ecs.write |
Deploy a snapshot to target; idempotent |
| GET | /deployments |
ecs.read |
List/filter by set, env, status |
| GET | /deployments/{deploymentId} |
ecs.read |
Status, policy results, logs cursor |
| POST | /refresh |
ecs.write |
Trigger refresh notifications (idempotent) |
Policies¶
| Method | Path | Scope | Notes |
|---|---|---|---|
| POST | /config-sets/{setId}/policies |
ecs.write |
Create policy |
| GET | /config-sets/{setId}/policies |
ecs.read |
List |
| GET | /config-sets/{setId}/policies/{policyId} |
ecs.read |
Get |
| PUT | /config-sets/{setId}/policies/{policyId} |
ecs.write |
Replace (If‑Match) |
| DELETE | /config-sets/{setId}/policies/{policyId} |
ecs.write |
Delete (If‑Match) |
Cross‑cutting headers & behaviors¶
| Header | Direction | Purpose |
|---|---|---|
Authorization: Bearer <jwt> |
In | OIDC; contains sub, tid (tenant), scp (scopes) |
Idempotency-Key |
In | UUIDv4 per create/trigger; 24h window, per tenant+route+body hash |
If-Match / ETag |
In/Out | Concurrency for updates/deletes; strong ETags on representations |
traceparent |
In | W3C tracing; echoed to spans |
x-correlation-id |
In/Out | Client-provided; echoed on response and logs |
x-api-version |
Out | Semantic API impl version |
RateLimit-* |
Out | From gateway (limits & reset) |
DTOs (canonical schemas)¶
openapi: 3.1.0
info:
title: ConnectSoft ECS Public API
version: 1.0.0
servers:
- url: https://api.connectsoft.com/ecs/api/v1
components:
securitySchemes:
oauth2:
type: oauth2
flows:
clientCredentials:
tokenUrl: https://auth.connectsoft.com/oauth2/token
scopes:
ecs.read: Read ECS resources
ecs.write: Create/update ECS resources
ecs.admin: Cross-tenant administration
schemas:
ConfigSet:
type: object
required: [id, name, appId, status, createdAt, updatedAt, etag]
properties:
id: { type: string, format: uuid }
name: { type: string, maxLength: 120 }
appId: { type: string }
labels: { type: array, items: { type: string, maxLength: 64 } }
status: { type: string, enum: [Active, Archived] }
description: { type: string, maxLength: 1024 }
createdAt: { type: string, format: date-time }
updatedAt: { type: string, format: date-time }
etag: { type: string }
ConfigItem:
type: object
required: [key, value, contentType]
properties:
key: { type: string, maxLength: 256, pattern: "^[A-Za-z0-9:\\._\\-/]+$" }
value: { oneOf: [ { type: object }, { type: array }, { type: string }, { type: number }, { type: boolean }, { type: "null" } ] }
contentType: { type: string, enum: [application/json, application/yaml, text/plain] }
isSecret: { type: boolean, default: false }
labels: { type: array, items: { type: string } }
etag: { type: string }
Snapshot:
type: object
required: [id, configSetId, version, hash, createdAt]
properties:
id: { type: string, format: uuid }
configSetId: { type: string, format: uuid }
version: { type: string, pattern: "^v\\d+\\.\\d+\\.\\d+$" }
note: { type: string, maxLength: 512 }
hash: { type: string, description: "SHA-256 of materialized content" }
createdBy: { type: string }
createdAt: { type: string, format: date-time }
Deployment:
type: object
required: [id, snapshotId, environment, status, startedAt]
properties:
id: { type: string, format: uuid }
snapshotId: { type: string, format: uuid }
environment: { type: string, enum: [dev, test, staging, prod] }
status: { type: string, enum: [Queued, InProgress, Succeeded, Failed, Cancelled] }
policyEval: { type: object, additionalProperties: true }
startedAt: { type: string, format: date-time }
completedAt: { type: string, format: date-time, nullable: true }
Problem:
type: object
description: RFC 7807 with ECS extensions
required: [type, title, status, traceId, code]
properties:
type: { type: string, format: uri }
title: { type: string }
status: { type: integer, minimum: 100, maximum: 599 }
detail: { type: string }
instance: { type: string }
code: { type: string, description: "Stable machine error code" }
traceId: { type: string }
tenantId: { type: string }
violations:
type: array
items: { type: object, properties: { field: {type: string}, message: {type: string}, code: {type: string} } }
ListResponse:
type: object
properties:
items: { type: array, items: { } } # overridden per path via allOf
nextCursor: { type: string, nullable: true }
total: { type: integer, description: "Optional total when cheap" }
Representative paths (excerpt)¶
paths:
/config-sets:
get:
security: [{ oauth2: [ecs.read] }]
parameters:
- in: query; name: cursor; schema: { type: string }
- in: query; name: limit; schema: { type: integer, minimum: 1, maximum: 200, default: 50 }
- in: query; name: sort; schema: { type: string, enum: [name, createdAt, updatedAt] }
- in: query; name: order; schema: { type: string, enum: [asc, desc], default: asc }
- in: query; name: filter; schema: { type: string, description: "See Filtering DSL" }
responses:
"200":
description: OK
content:
application/json:
schema:
allOf:
- $ref: "#/components/schemas/ListResponse"
- type: object
properties:
items:
type: array
items: { $ref: "#/components/schemas/ConfigSet" }
default:
description: Error
content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } }
post:
security: [{ oauth2: [ecs.write] }]
requestBody:
required: true
content:
application/json:
schema:
type: object
required: [name, appId]
properties:
name: { type: string }
appId: { type: string }
description: { type: string }
labels: { type: array, items: { type: string } }
parameters:
- in: header; name: Idempotency-Key; required: true; schema: { type: string, format: uuid }
responses:
"201": { description: Created, headers: { ETag: { schema: { type: string } } }, content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }
"409": { description: Conflict, content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } }
/config-sets/{setId}:
get:
security: [{ oauth2: [ecs.read] }]
parameters: [ { in: path, name: setId, required: true, schema: { type: string, format: uuid } } ]
responses: { "200": { content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }, "404": { content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } } }
patch:
security: [{ oauth2: [ecs.write] }]
parameters:
- { in: path, name: setId, required: true, schema: { type: string, format: uuid } }
- { in: header, name: If-Match, required: true, schema: { type: string } }
requestBody:
content:
application/merge-patch+json:
schema: { type: object, properties: { description: {type: string}, labels: { type: array, items: {type: string} }, status: {type: string, enum: [Active, Archived]} } }
responses:
"200": { headers: { ETag: { schema: { type: string } } }, content: { application/json: { schema: { $ref: "#/components/schemas/ConfigSet" } } } }
"412": { description: Precondition Failed, content: { application/problem+json: { schema: { $ref: "#/components/schemas/Problem" } } } }
/config-sets/{setId}/items:batch:
post:
security: [{ oauth2: [ecs.write] }]
parameters:
- { in: path, name: setId, required: true, schema: { type: string, format: uuid } }
- { in: header, name: Idempotency-Key, required: true, schema: { type: string, format: uuid } }
requestBody:
content:
application/json:
schema:
type: object
properties:
upserts: { type: array, items: { $ref: "#/components/schemas/ConfigItem" } }
deletes: { type: array, items: { type: string, description: "keys to delete" } }
responses:
"200": { description: Applied, content: { application/json: { schema: { type: object, properties: { upserted: {type: integer}, deleted: {type: integer} } } } } }
Error model (Problem Details)¶
Media type: application/problem+json
Base fields: type, title, status, detail, instance
ECS extensions:
code– stable machine code (e.g.,ECS.CONFLICT.ETAG_MISMATCH,ECS.VALIDATION.FAILED)traceId– correlates with OTEL tracestenantIdviolations[]– field‑level errors (field,message,code)
Examples
- 409 Conflict (duplicate name):
code=ECS.CONFLICT.DUPLICATE_NAME - 412 Precondition Failed (ETag):
code=ECS.CONFLICT.ETAG_MISMATCH - 429 Too Many Requests (rate limit):
code=ECS.RATE_LIMITEDwithRateLimit-*headers
Idempotency¶
- Required on POST that create (
/config-sets,/deployments,/refresh) or batch‑modify (items:batch) resources. - Key scope:
{tenantId}|{route}|SHA256(body); window: 24 hours (configurable). - Behavior:
- First request persists idempotency record with response body, status, headers (including
ETag,Location). - Subsequent identical requests (same key) return cached response with
Idempotent-Replay: true. - Key collision with different body hash →
409withcode=ECS.IDEMPOTENCY.BODY_MISMATCH.
- First request persists idempotency record with response body, status, headers (including
- Safe methods (GET/HEAD) must not accept
Idempotency-Key.
Pagination, sorting & filtering¶
Cursor pagination (default)¶
- Query:
cursor,limit(1–200; default 50). - Response:
nextCursor(opaque). When null, end of page. - Optional
totalwhen cheap to compute; otherwise omitted.
Sorting¶
sortin a whitelist per resource (e.g.,name|createdAt|updatedAt).order:asc|desc(defaultasc).
Filtering DSL (simple, safe)¶
- Single
filterparameter using a constrained expression language:- Grammar:
expr := field op value; op := eq|ne|gt|lt|ge|le|in|like - Conjunctions with
and,or; parentheses supported. - Values URL‑encoded; strings in single quotes.
- Grammar:
- Examples:
filter=appId eq 'billing-service' and status eq 'Active'filter=labels in ('prod','blue')filter=updatedAt gt '2025-08-01T00:00:00Z'
- Field allow‑list per resource to prevent injection.
Typical client flow (sequence)¶
sequenceDiagram
participant Cli as Client
participant GW as API Gateway
participant API as ECS REST API
participant PE as Policy Engine
participant RO as Refresh Orchestrator
Cli->>GW: POST /config-sets (Idempotency-Key)
GW->>API: JWT(scopes=ecs.write, tid=...) + headers
API-->>Cli: 201 ConfigSet (ETag)
Cli->>GW: POST /config-sets/{setId}/items:batch (Idempotency-Key)
GW->>API: Upserts & deletes
API-->>Cli: 200 summary
Cli->>GW: POST /config-sets/{setId}/snapshots (Idempotency-Key)
GW->>API: create snapshot
API-->>Cli: 201 Snapshot
Cli->>GW: POST /deployments (Idempotency-Key)
GW->>API: deploy snapshot to prod
API->>PE: Evaluate policies
API-->>Cli: 202 Deployment accepted
Cli->>GW: POST /refresh (Idempotency-Key)
GW->>RO: emit refresh signal
RO-->>Cli: 202 Accepted
Security & scopes quick map¶
| Operation | Scope(s) |
|---|---|
| Read any resource | ecs.read |
| Create/update/delete set/items/snapshots/deployments/policies | ecs.write |
| Cross‑tenant admin (admin paths) | ecs.admin |
Tenant is resolved from JWT
tidclaim; server rejects any cross‑tenant attempts on non‑admin routes.
SDK alignment (high‑level)¶
-
.NET/TypeScript SDKs will generate from this OpenAPI, exposing:
-
EcsClientwith typed methods (createConfigSetAsync,listConfigSetsAsync…) - Optional resilience policies (retries on 409/412 with ETag refresh).
- Built‑in Idempotency-Key generator and replay handling.
- Paged iterators (
for await (const set of client.configSets.listAll(filter))).
Solution Architect Notes¶
- Filtering DSL: accept this simple grammar now, revisit OData/RSQL compatibility later if needed.
- Idempotency window: proposed 24h; confirm compliance requirements (auditability vs storage pressure).
- Max page size: 200 proposed; validate against expected cardinality of items per set.
- Secrets:
isSecret=truevalues are write‑only; reads must return masked; align with provider capabilities. - Multi‑format content: support YAML in
Acceptfor content endpoints; default JSON elsewhere. - Admin paths: include
/admin/tenants/{tenantId}/...only for ECS Ops/partners; hide from standard tenants.
Ready‑to‑build artifacts¶
- OpenAPI 3.1 file (expand from excerpt).
- Contract tests (SDK acceptance):
- Idempotency replay
- ETag/If‑Match conflict
- Cursor traversal
- Filter grammar parsing & allow‑list enforcement
- Gateway route + scope policy generation from tags/paths.
gRPC Contracts — low‑latency reads & streaming refresh channels¶
Scope & goals¶
This section defines gRPC service contracts for low‑latency SDK/agent scenarios and push‑based refresh. It complements the REST API by optimizing the read hot path and near‑real‑time change delivery with backpressure, resumability, and idempotent semantics.
Outcomes
- Unary Resolve with sub‑10 ms in‑cluster latency (cached).
- Server/bidi Refresh streams with at‑least‑once delivery and client acks.
- Error details via
google.rpc.*for actionable retries/backoff. - Strong multi‑tenant isolation via metadata + request fields.
Transport & security assumptions¶
- gRPC over HTTP/2, TLS required; mTLS for service‑to‑service.
- OIDC JWT in
authorization: Bearer <jwt>metadata; audienceecs.api. - Per‑call deadline required by clients; server enforces sane max (e.g., 5s for unary, 60m for streams).
- Tenant context required via metadata
x-tenant-id(validated against JWT) and echoed in responses. - OTEL propagation:
grpc-trace-bin/traceparentmetadata.
Packages & options¶
syntax = "proto3";
package connectsoft.ecs.v1;
option csharp_namespace = "ConnectSoft.Ecs.V1";
option go_package = "github.com/connectsoft/ecs/api/v1;ecsv1";
import "google/protobuf/any.proto";
import "google/protobuf/struct.proto";
import "google/protobuf/timestamp.proto";
import "google/rpc/error_details.proto";
Services & RPCs (IDL)¶
// Low-latency, cached resolution of effective configuration.
service ResolveService {
// Resolve a single key or a set path; supports conditional fetch with ETag.
rpc Resolve(ResolveRequest) returns (ResolveResponse);
// Batch resolve multiple keys/paths in one RPC (atomic per key).
rpc ResolveBatch(ResolveBatchRequest) returns (ResolveBatchResponse);
// Enumerate keys under a path with server-side paging.
rpc ListKeys(ListKeysRequest) returns (ListKeysResponse);
}
// Push notifications for changes; supports server-stream and bidi with ACKs.
service RefreshChannel {
// Simple server-stream subscription; ACKs are implicit by flow control.
rpc Subscribe(Subscription) returns (stream RefreshEvent);
// Resumable, at-least-once delivery with explicit ACKs on the same stream.
rpc SubscribeWithAck(stream RefreshClientMessage)
returns (stream RefreshServerMessage);
}
// (Optional/internal) admin operations for agents and adapters.
service InternalAdmin {
rpc SaveDraft(SaveDraftRequest) returns (SaveDraftResponse);
rpc Publish(PublishRequest) returns (PublishResponse);
rpc Rollback(RollbackRequest) returns (RollbackResponse);
}
Messages (core)¶
// Common context for resolution.
message Context {
string environment = 1; // dev|test|staging|prod
string app_id = 2; // logical application id
string service = 3; // optional microservice id
string instance = 4; // optional instance id
string edition_id = 5; // optional, overrides tenant default
map<string,string> labels = 10;// optional targeting labels
}
message ResolveRequest {
string path = 1; // e.g., "apps/billing/db/connectionString"
Context context = 2;
string if_none_match_etag = 3; // conditional fetch
// If version unset, server uses "latest".
string version = 4; // semantic or snapshot id
}
message ResolvedValue {
google.protobuf.Value value = 1; // JSON value
string content_type = 2; // application/json|yaml|text/plain
}
message ResolveResponse {
bool not_modified = 1; // true when ETag matches
string etag = 2; // strong ETag of resolved content
ResolvedValue resolved = 3; // omitted when not_modified=true
string provenance_version = 4; // snapshot or semantic version
google.protobuf.Timestamp expires_at = 5; // cache hint
map<string,string> meta = 9; // server hints (e.g., "stale-while-revalidate":"2s")
}
message ResolveBatchRequest {
repeated ResolveRequest requests = 1;
}
message ResolveBatchResponse {
repeated ResolveResponse responses = 1; // 1:1 with requests (same order)
}
message ListKeysRequest {
string path = 1; // list under this prefix
Context context = 2;
int32 page_size = 3; // 1..500, default 100
string page_token = 4;// opaque
// Optional filtering on labels and name
string filter = 5; // e.g., "name like 'conn%'" or "label in ('prod','blue')"
string order_by = 6; // "name asc|desc", "updated_at desc"
}
message KeyEntry {
string key = 1;
string etag = 2;
google.protobuf.Timestamp updated_at = 3;
repeated string labels = 4;
}
message ListKeysResponse {
repeated KeyEntry items = 1;
string next_page_token = 2; // empty => end
}
Streaming refresh contracts¶
// What to watch.
message Subscription {
// If omitted, server infers tenant from metadata and defaults to latest/any env.
Context context = 1;
// Paths to watch. Supports prefix semantics.
repeated PathSelector selectors = 2;
// Receive only specific event types.
repeated EventType event_types = 3;
// Resumability: provide last committed cursor to continue after reconnect.
string resume_after_cursor = 4;
// Heartbeat period requested by client (server may clamp).
int32 heartbeat_seconds = 5;
}
message PathSelector {
string value = 1; // e.g., "apps/billing/**" or exact key
SelectorType type = 2; // EXACT|PREFIX|GLOB
}
enum SelectorType { SELECTOR_TYPE_UNSPECIFIED = 0; EXACT = 1; PREFIX = 2; GLOB = 3; }
enum EventType {
EVENT_TYPE_UNSPECIFIED = 0;
CONFIG_PUBLISHED = 1; // new immutable version published
CACHE_INVALIDATED = 2; // targeted cache purge
POLICY_UPDATED = 3; // policy/edition overlay changed
}
message RefreshEvent {
string event_id = 1; // unique id for idempotency
string cursor = 2; // monotonically increasing per-tenant offset
EventType type = 3;
string path = 4; // affected key/prefix
string version = 5; // snapshot/semantic version
string etag = 6; // new etag for resolve
google.protobuf.Timestamp time = 7;
map<string,string> meta = 8; // e.g., reason, actor, correlation id
}
message RefreshClientMessage {
oneof message {
Subscription subscribe = 1;
Ack ack = 2;
Heartbeat heartbeat = 3;
}
}
message RefreshServerMessage {
oneof message {
RefreshEvent event = 1;
Heartbeat heartbeat = 2;
Nack nack = 3; // e.g., invalid subscription; includes retry info
}
}
message Ack {
string cursor = 1; // last successfully processed cursor
}
message Nack {
string code = 1; // stable machine code, e.g., "ECS.SUBSCRIPTION.INVALID_SELECTOR"
string human_message = 2;
google.rpc.RetryInfo retry = 3; // backoff guidance
}
message Heartbeat {
google.protobuf.Timestamp time = 1;
string server_id = 2;
}
Delivery guarantees
- At-least-once: Events may repeat; dedupe by
event_idorcursor. - Ordering: Per tenant and selector forward ordering is preserved best-effort; cross‑selector ordering is not guaranteed.
- Resumption: Provide
resume_after_cursororAck.cursorto continue after disconnect. - Flow control: gRPC backpressure applies. Server honors slow consumers with bounded buffers, then nacks with
RESOURCE_EXHAUSTEDwhen limits are exceeded.
Error model & rich details¶
- gRPC status codes +
google.rpc.*:INVALID_ARGUMENT+BadRequest(schema/filter/selector errors)UNAUTHENTICATED,PERMISSION_DENIEDNOT_FOUND,FAILED_PRECONDITION(ETag mismatch on writes in internal admin)RESOURCE_EXHAUSTED+QuotaFailure(rate/tenant quotas)ABORTED(concurrency conflict)UNAVAILABLE+RetryInfo(transient)
- Common
ErrorInfofields:reason: stable code likeECS.IDEMPOTENCY.BODY_MISMATCHdomain:"ecs.connectsoft.cloud"metadata:tenantId,path,cursor,retryAfter
Metadata (headers/trailers)¶
Incoming (from client)
authorization: Bearer <jwt>(required)x-tenant-id: <id>(required if not derivable from host; must match JWT)x-idempotency-key: <uuid>(for InternalAdmin mutating RPCs)traceparent/grpc-trace-bin(optional, recommended)
Outgoing (from server)
x-tenant-id,x-edition-id,x-plan-tier(echo)x-cache: hit|miss|revalidated(Resolve*)x-etag: <etag>(Resolve*)- Trailers may include
google.rpc.status-details-binfor rich error details
Semantics & performance notes¶
Resolve / ResolveBatch¶
- Reads are served from Redis (read‑through) with ETag; DB only on miss.
not_modified=truewhenif_none_match_etagmatches current value.expires_atprovides a soft TTL hint for SDK cache; SDK should also listen toRefreshChannelfor proactive reloads.
ListKeys¶
- Server‑validated
filter(subset of REST DSL) with allow‑listed fields only. - Stable ordering with cursor tokens (opaque).
RefreshChannel¶
- Subscribe: simplest server stream; rely on TCP/HTTP2 flow control and client reconnect with resume on disconnects.
- SubscribeWithAck: recommended for agents and high‑value consumers; server commits delivery when it receives
Ackwith last cursor. - Heartbeats sent every 15–60 s (configurable). Client should close/reopen on missing 3 heartbeats.
Versioning & compatibility¶
- Package versioned at
connectsoft.ecs.v1. - Additive changes only (new fields with higher tags, default behavior preserved).
- New event types must default to ignore on unknown by SDKs; reserve numeric ranges per family.
Optional .NET server (code‑first) sketch¶
[ServiceContract(Name = "ResolveService", Namespace = "connectsoft.ecs.v1")]
public interface IResolveService
{
ValueTask<ResolveResponse> Resolve(ResolveRequest request, CallContext ctx = default);
ValueTask<ResolveBatchResponse> ResolveBatch(ResolveBatchRequest request, CallContext ctx = default);
ValueTask<ListKeysResponse> ListKeys(ListKeysRequest request, CallContext ctx = default);
}
[ServiceContract(Name = "RefreshChannel", Namespace = "connectsoft.ecs.v1")]
public interface IRefreshChannel
{
IAsyncEnumerable<RefreshEvent> Subscribe(Subscription sub, CallContext ctx = default);
IAsyncEnumerable<RefreshServerMessage> SubscribeWithAck(
IAsyncEnumerable<RefreshClientMessage> messages, CallContext ctx = default);
}
SDK guidance (client side)¶
- Always set a deadline (e.g., 250–500 ms for
Resolve, 2–5 s forResolveBatch). - Respect
RetryInfoand gRPC status codes; use exponential backoff with jitter forUNAVAILABLE/RESOURCE_EXHAUSTED. - Maintain a local cache keyed by
(tenant, context, path)with ETag; prefer refresh stream over polling. - Deduplicate refresh events by
event_id/cursor; checkpoint the last committed cursor. - Use per‑tenant channels for isolation and to preserve ordering.
Solution Architect Notes¶
- The ACKed bidi stream should be the default for SDKs; simple server-stream fits lightweight/mobile.
- Implement cursor compaction (periodic checkpoints) to bound replay windows.
- Consider compressed payloads (
grpc-encoding: gzip) for large batch resolves; set max message size caps. - Add feature flag to emit CloudEvents envelope on refresh events when integrating with external buses.
- Validate multi‑region behavior: cursor is region‑scoped; cross‑region failover resets to last replicated cursor.
This contract enables high‑performance, low‑latency config reads and robust, resumable change propagation—ready for codegen and template scaffolding.
Config Versioning & Lineage — semantic versioning, snapshotting, diffs, provenance, rollback, immutability guarantees¶
Objectives¶
Define how configuration is versioned, snapshotted, diffed, traced, and rolled back in ECS with cryptographic immutability and auditable lineage—so engineering can implement storage, APIs, and SDK behaviors consistently.
Versioning Model (SemVer + Content Addressing)¶
Concepts¶
- Draft – mutable workspace of a
ConfigSet(unpublished). - Snapshot – immutable point‑in‑time materialization of a
ConfigSetafter policy validation. - Version (SemVer) – human‑friendly tag (e.g.,
v1.4.2) aliased to a concrete Snapshot. - Alias – semantic pointers like
latest,stable,lts, environment pins (prod-current). - Content Hash –
SHA-256(canonical-json)of materialized set; primary ETag and immutability anchor.
Rules¶
- SemVer (
MAJOR.MINOR.PATCH):MAJORfor breaking schema/keys removal,MINORfor additive keys,PATCHfor value changes only.- Pre‑release allowed for staged rollouts:
v1.2.0-rc.1.
- ETag = base64url(
SHA-256(canonical-json)); strong validator for reads and updates. - Canonicalization (for hashing/diff):
- Deterministic key ordering (UTF‑8, lexicographic).
- No insignificant whitespace; numbers normalized; booleans/literals preserved.
- Secrets are represented by references (not raw values) when computing hash to avoid exposure and to allow secret rotation without content drift (see “Secrets & Immutability”).
Snapshotting Lifecycle¶
sequenceDiagram
participant UI as Studio UI
participant API as Registry API
participant POL as Policy Engine
participant AUD as Audit Store
participant ORC as Refresh Orchestrator
UI->>API: Save Draft (mutations…)
UI->>API: POST /config-sets/{id}/snapshots (Idempotency-Key)
API->>POL: Validate schema + policy overlays
POL-->>API: OK (violations=[])
API-->>API: Materialize canonical JSON (ordered), compute SHA-256
API->>AUD: Append SnapshotCreated (hash, size, actor, source)
API-->>UI: 201 Snapshot {snapshotId, etag, materializedHash}
UI->>API: POST /deployments (optional)
API->>ORC: Emit ConfigPublished (tenant, set, version/etag)
Snapshot Invariants¶
- Write‑once:
Snapshotrows are append‑only. No UPDATE/DELETE. - Hash‑stable: identical content → identical hash → idempotent creation (API returns 200 with existing Snapshot).
- Schema‑valid: publish fails if JSON Schema or policy guardrails fail.
- Provenance‑rich: who/when/why/source captured (see below).
Lineage & Provenance¶
Data captured per Snapshot¶
| Field | Description | ||
|---|---|---|---|
snapshotId (GUID) |
Unique identity, primary key. | ||
configSetId |
Parent set. | ||
semver |
Optional tag (can move between snapshots); stored in separate mapping. | ||
hash |
SHA‑256 of canonical materialization. | ||
parents[] |
Array of parent snapshot IDs (for merges); by default single parent for linear history. | ||
createdAt/By |
UTC timestamp + normalized actor id (sub@iss). |
||
source |
`Studio | API | Adapter:{provider}`. |
reason |
Human message “why” (commit message). | ||
policySummary |
Validations/approvals references. | ||
sbom (optional) |
Content provenance (templates/rules versions) for compliance. |
Lineage graph (DAG)¶
- Default: linear chain per
ConfigSet. - Merge: Allowed via “import & reconcile” workflows → multi‑parent Snapshot; computed diff is 3‑way.
graph LR
A((S-001))-->B((S-002))
B-->C((S-003))
C-->D((S-004))
X((Feature-Branch S-00X))-->M((S-005 merge))
D-->M
Engineering may begin with linear history; merge support can be introduced as an additive feature flag.
Diff Strategy¶
Outputs¶
- Structural diff (RFC 6902 JSON Patch) for machine application.
- Human diff (grouped change list) for Studio UI, with highlights by key/area.
- Semantic diff (optional) using Schema annotations to mark breaking/additive changes.
Calculation¶
- Materialize both snapshots into canonical JSON.
- Run structural diff producing operations:
add,remove,replace,move(rare),copy(rare),test(not emitted). - Annotate operations with:
- Severity (Breaking/Additive/Neutral) via schema.
- Scope (security‑sensitive vs. safe).
- Blast radius (estimated by affected services).
API¶
POST /config-sets/{id}/diff— body:{ from: snapshotId|semver, to: snapshotId|semver }Returns:{ patch: JsonPatch[], summary: { breaking:int, additive:int, neutral:int } }.
Rollback Semantics¶
- Rollback = create new Snapshot from historical Snapshot content; never mutates past entries.
- Metadata:
reason = "rollback to S-00A (v1.3.0)"rollbackOf = "S-00A"
- Events:
ConfigRollbackInitiated(audit)ConfigPublished(normal propagation)
- Safety Checks:
- Policy re‑validation on rollback content (schemas may have evolved).
- Optional “force” flag to bypass non‑breaking warnings (requires
tenant.admin).
- Environment pins (aliases) update to point to the new rollback Snapshot.
Immutability Guarantees¶
| Area | Guarantee | Mechanism |
|---|---|---|
| Snapshot content | Write‑once, tamper‑evident | Append‑only tables, content hash (ETag), optional signing (JWS) |
| Audit trail | Non‑repudiable | Append‑only log, sequence IDs, hash‑chain (optional) |
| SemVer tag | Mutable pointer | Separate mapping table; changes audited |
| Aliases/pins | Mutable pointer | Versioned alias map with audit |
| Secret values | Not embedded in materialized content | References (KMS/KeyVault path) substituted at resolve time |
Secrets & Immutability: to avoid hash drift and data exposure, Snapshots store secret references (e.g.,
kvref://vault/secret#version) rather than plaintext. The resolved value is injected at read time by SDK/Resolver based on caller’s credentials.
Storage & Indices (solution‑level)¶
Tables (logical):
ConfigSet(id, tenantId, name, appId, status, createdAt, updatedAt, etagLatest)Snapshot(id, configSetId, hash, createdAt, createdBy, source, reason, policySummaryJson, parentsJson)SnapshotContent(snapshotId, contentJson, sizeBytes)— optional columnar or compressed storageVersionTag(configSetId, semver, snapshotId, taggedAt, taggedBy)— unique(configSetId, semver)Alias(configSetId, alias, snapshotId, updatedAt, updatedBy)—alias in {latest,stable,lts,dev-current,qa-current,prod-current,…}Audit(id, tenantId, setId?, snapshotId?, action, actor, time, attrsJson)
Indexes (suggested):
Snapshot(configSetId, createdAt desc)for latest queriesVersionTag(configSetId, semver)andAlias(configSetId, alias)Audit(tenantId, time desc);Audit(snapshotId)
Retention:
- Keep all Snapshots by default.
- Optional policy: retain N latest per set (e.g., 200) excluding those pinned by VersionTag/Alias or referenced by Deployments.
Concurrency, ETags & Idempotency¶
- Draft mutations: optimistic writes with
If-Match: <ETagDraft>. - Snapshot creation: idempotent by
(tenantId|configSetId|hash); server returns existing Snapshot if hash matches. - Version tags: set/update with
If-Matchon current mapping row; conflict →412 Precondition Failed. - Deployments/Refresh: require Idempotency-Key (as specified in REST cycle).
SDK Alignment¶
- Read Path: SDK caches by
(tenant, setId, context, path), usingResolve(gRPC) withif_none_match_etag. OnRefreshEvent(etag), SDK fetches only when ETag changed. - Pinning:
Resolveacceptsversion|semver|alias. When alias used, response includes actual snapshotId/semver for telemetry. - Diff Tooling: SDK helper to produce human diff from JsonPatch with path collapsing and schema annotations.
Operational Runbooks¶
Tagging & Promotion
- Tag new Snapshot:
POST /config-sets/{id}/tags { semver: "v1.4.0" } - Update environment pin:
PUT /config-sets/{id}/aliases/prod-current -> S-00D - Audit both steps and verify ConfigPublished propagation.
Rollback
- Locate target Snapshot in Studio (diff view vs current).
- Trigger
POST /config-sets/{id}:rollback { to: "S-00B" } - Monitor policy re‑validation; confirm alias pins and deployments updated.
Forensics
- Export lineage:
GET /config-sets/{id}/lineage?format=graphml|json - Verify hash chain matches SBOM and policy versions.
Example: Materialization & Hash¶
// Canonical canonical-json (excerpt)
{
"apps": {
"billing": {
"db": {
"connectionString": "kvref://vault/billing-db#v3",
"poolSize": 50
}
}
},
"_meta": {
"schemaVersion": "2025-08-01",
"labels": ["prod","blue"]
}
}
// ETag = base64url(SHA-256(bytes(canonical-json)))
// Secrets will be resolved by SDK/Resolver at runtime using kvref.
Studio UI: Version Tree & Diff UX¶
- Tree view of snapshots with badges (semver, aliases, deployments).
- Diff panel: switch between structural/semantic views, filter by severity (
breaking|additive|neutral). - Rollback CTA gated by policy result; requires approval if breaking changes detected.
Solution Architect Notes¶
- Start with linear lineage (single parent); design tables to allow future DAG merges.
- Implement canonicalizer as a shared library (used by Registry & SDKs) to avoid hashing inconsistencies.
- Add optional snapshot signing (JWS) in a future cycle to strengthen tamper evidence and supply chain stories.
- Confirm secret reference design with Security and Provider Adapter teams to ensure consistent resolution across clouds.
- Define policy‐driven SemVer bumping: server can recommend version bump level based on semantic diff.
Storage Strategy — CockroachDB (MT‑Aware) & Redis Cache Topology¶
Objectives¶
- Provide a multi‑tenant, multi‑region storage design that preserves immutability for configuration versions and snapshots.
- Optimize read latency for SDKs/agents via Redis + regional replicas while keeping CockroachDB (CRDB) as the system of record.
- Enforce edition‑aware retention, partitioning, and operational recoverability (PITR, incremental backups, table‑level restore).
Logical Data Model (solution view)¶
erDiagram
TENANT ||--o{ APP : owns
TENANT ||--o{ ENVIRONMENT : scopes
APP ||--o{ NAMESPACE : groups
NAMESPACE ||--o{ CONFIG_SET : defines
CONFIG_SET ||--o{ CONFIG_VERSION : immutably_versions
CONFIG_VERSION ||--o{ SNAPSHOT : captures
CONFIG_VERSION ||--o{ DIFF : compares
CONFIG_SET ||--o{ POLICY_BINDING : governed_by
CONFIG_SET ||--o{ ROLLOUT : deployed_via
ROLLOUT ||--o{ ROLLOUT_STEP : stages
EVENT_AUDIT }o--|| TENANT : scoped
TENANT {
uuid tenant_id PK
text slug UNIQUE
text edition // "Free","Pro","Enterprise"
string crdb_region_home
}
APP {
uuid app_id PK
uuid tenant_id FK
text key // unique per tenant
text display_name
}
ENVIRONMENT {
uuid env_id PK
uuid tenant_id FK
text name // "dev","test","prod"
}
NAMESPACE {
uuid ns_id PK
uuid tenant_id FK
uuid app_id FK
text path // e.g. "payments/api"
}
CONFIG_SET {
uuid set_id PK
uuid tenant_id FK
uuid ns_id FK
uuid env_id FK
text set_key
bool is_composite
text content_hash // current head
}
CONFIG_VERSION {
uuid version_id PK
uuid tenant_id FK
uuid set_id FK
string semver
timestamptz created_at
text author
text change_summary
jsonb content // canonical compiled payload
text content_hash UNIQUE
bool is_head // convenience flag
text provenance // URI/commit ref
bool signed
bytea signature
}
SNAPSHOT {
uuid snapshot_id PK
uuid tenant_id FK
uuid version_id FK
timestamptz captured_at
jsonb content
text source // "manual","pre-rollout","post-rollout"
}
DIFF {
uuid diff_id PK
uuid tenant_id FK
uuid from_version_id FK
uuid to_version_id FK
jsonb diff_json // RFC6902/semantic diff
text strategy // "jsonpatch","semantic"
}
POLICY_BINDING {
uuid policy_id PK
uuid tenant_id FK
uuid set_id FK
jsonb rules
}
ROLLOUT {
uuid rollout_id PK
uuid tenant_id FK
uuid set_id FK
uuid target_env_id FK
text status // "planned","in-progress","complete","failed","rolled-back"
text strategy // "all-at-once","canary","wave"
}
ROLLOUT_STEP {
uuid step_id PK
uuid tenant_id FK
uuid rollout_id FK
int step_no
jsonb selector // services/pods/regions
jsonb outcome
}
EVENT_AUDIT {
uuid event_id PK
uuid tenant_id FK
text type
jsonb payload
timestamptz at
text actor
text trace_id
}
Notes
tenant_idscopes every row. DTOs and queries must supplytenant_id(and oftenenv_id) for isolation and index selectivity.CONFIG_VERSION.contentis the immutable compiled document delivered to SDKs (policy already applied). Raw fragments (if used) live in internal tables or object storage and are referenced viaprovenance.
Physical Design in CockroachDB¶
Multi‑Region & Locality¶
- Database locality:
ALTER DATABASE ecs SET PRIMARY REGION <home>; ADD REGION <others>; SURVIVE REGION FAILURE; - Table locality:
- Control/lookup tables:
REGIONAL BY ROWwith columncrdb_region(derived from tenant’s home or set/rollout target). - Global reference tables (rare):
GLOBAL(e.g., editions, feature flags) to avoid cross‑region fan‑out.
- Control/lookup tables:
- Write routing: SDK/Studio via Gateway injects
crdb_region/tenant_id; CRDB routes to nearest leaseholder.
Keys, Sharding & Indexes¶
- Primary keys: hash‑sharded to avoid hot‑ranges.
- Example:
PRIMARY KEY (tenant_id, set_id, semver) USING HASH WITH BUCKET_COUNT = 16
- Example:
- Surrogate IDs: prefer UUIDv7 for time‑ordered locality; store also
semverfor human lookup. - Secondary indexes (all with
STORINGwhere helpful):CONFIG_VERSION(tenant_id, set_id, is_head DESC, created_at DESC)CONFIG_VERSION(tenant_id, content_hash)for fast idempotency checks.ROLLOUT(tenant_id, target_env_id, status)for orchestration queries.
- JSONB columns (
content,diff_json) gain targeted GIN indexes for frequently filtered paths (e.g.,/featureToggles/*).
Concurrency, Immutability & Idempotency¶
- New version creation:
- Compute
content_hash; reject duplicates (idempotent PUT). - Mark prior
is_head = falseatomically. - Use SERIALIZABLE transactions with small write sets (CRDB default).
- Compute
- No in‑place edits of
CONFIG_VERSION.content. Rollback = new version withchange_summary="revert to X"and pointer flip.
Sample DDL (illustrative)¶
-- Multi-region setup (performed once)
ALTER DATABASE ecs SET PRIMARY REGION eu-central;
ALTER DATABASE ecs ADD REGION eu-west;
ALTER DATABASE ecs ADD REGION us-east;
ALTER DATABASE ecs SURVIVE REGION FAILURE;
-- Config versions table
CREATE TABLE ecs.config_version (
tenant_id UUID NOT NULL,
set_id UUID NOT NULL,
version_id UUID NOT NULL DEFAULT gen_random_uuid(),
semver STRING NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
author STRING NULL,
change_summary STRING NULL,
content JSONB NOT NULL,
content_hash STRING NOT NULL,
provenance STRING NULL,
is_head BOOL NOT NULL DEFAULT false,
signed BOOL NOT NULL DEFAULT false,
signature BYTES NULL,
crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
CONSTRAINT pk_config_version PRIMARY KEY (tenant_id, set_id, semver) USING HASH WITH BUCKET_COUNT = 16,
UNIQUE (tenant_id, content_hash),
UNIQUE (tenant_id, set_id, version_id)
) LOCALITY REGIONAL BY ROW;
CREATE INDEX ix_config_version_head ON ecs.config_version (tenant_id, set_id, is_head DESC, created_at DESC) STORING (content_hash, version_id);
-- TTL example for audit events (see Retention section)
CREATE TABLE ecs.event_audit (
tenant_id UUID NOT NULL,
event_id UUID NOT NULL DEFAULT gen_random_uuid(),
type STRING NOT NULL,
payload JSONB NOT NULL,
at TIMESTAMPTZ NOT NULL DEFAULT now() ON UPDATE now(),
actor STRING NULL,
trace_id STRING NULL,
crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
ttl_expires_at TIMESTAMPTZ NULL,
CONSTRAINT pk_event_audit PRIMARY KEY (tenant_id, at, event_id)
) LOCALITY REGIONAL BY ROW
WITH (ttl = 'on', ttl_expiration_expression = 'ttl_expires_at', ttl_job_cron = '@hourly');
Redis Cache Topology (read path acceleration)¶
Topology¶
- Redis Cluster (3+ shards per region, replicas ×1) deployed per cloud region close to ECS Gateway.
- Tenant‑pinned key hashing using hash‑tags to keep hot tenants together and enable selective scaling:
ecs:{tenantId}:{env}:{ns}:{setKey}:v{semver}→ value = compact binary (MessagePack) or gzip JSON. - SDK Near‑Cache (optional) + soft TTL to bound staleness without thundering herds.
Cache Patterns¶
| Concern | Pattern |
|---|---|
| Freshness | Pub/Sub channel ecs:invalidate:{tenantId} from Refresh Orchestrator; SDKs subscribe (websocket) or poll backoff. |
| Warm‑up | On new head: background warmers push into Redis (read‑through fallback to CRDB if miss). |
| Stampede control | Singleflight with Redis SET key lock NX PX=<ms>; losers await pub/sub invalidate. |
| Multi‑region | Writes publish region‑scoped invalidations; cross‑region mirror via replication stream (or event bus) for global tenants. |
| Security | Values include HMAC(content_hash); SDKs verify before use. |
TTL & Sizing¶
- Default TTL = 5–30s (edition‑dependent), max TTL for offline resilience (e.g., 10m).
- Memory policy:
allkeys-lru(Free/Pro),volatile-lru(Enterprise with pinning for critical keys). - Per‑tenant quotas enforced by keyspace cardinality + prefix scanning in ops tools.
Partitioning Strategy¶
By Tenant & Region¶
- Every table keyed by
tenant_id; hash‑sharded PK prevents hot ranges. REGIONAL BY ROW+crdb_regionplaces data near compute. Rollouts targetingus-eastcan place derived rows there.
By Time (high‑volume tables)¶
EVENT_AUDITand rollout telemetry: composite PK(tenant_id, at, event_id)enables time‑bounded range scans and row‑level TTL.- Optional monthly export/compact to object storage for long‑term archives.
Large Payloads¶
- Keep operational content (<= 128 KB) in CRDB JSONB; store oversized artifacts (e.g., generated diffs > 1 MB) in object storage and reference by URI in
provenance.
Retention & Data Lifecycle¶
| Data class | Default | Edition overrides | Mechanism |
|---|---|---|---|
Head versions (is_head = true) |
Keep indefinitely | — | none |
| Historical config versions | 365 days | Enterprise: infinite / policy based | soft policy + archival export |
| Snapshots (pre/post rollout) | 180 days | Enterprise: 730 days | CRDB row‑level TTL on ttl_expires_at |
| Audit events, access logs | 90 days | Enterprise: 365 days | TTL + periodic export (Parquet) |
| Diff artifacts | 90 days | Pro/Ent: 180 days | TTL |
Implementation
- Set
ttl_expires_at = now() + INTERVAL 'N days'by class & edition. - Nightly export of expiring partitions to object storage (Parquet) before purge.
Backup & Restore (RPO/RTO)¶
Objectives¶
- RPO ≤ 5 minutes (Enterprise), ≤ 15 minutes (Pro), ≤ 24 hours (Free).
- RTO ≤ 30 minutes for tenant‑scoped table restore; ≤ 2 hours full cluster DR (Enterprise).
Strategy¶
- Cluster‑wide scheduled backups:
- Weekly full + hourly incremental to cloud bucket (regional replica buckets for locality).
- Encryption with cloud KMS; rotate keys quarterly.
- Changefeeds (optional) for external archival/analytics sinks.
- PITR: CRDB protected timestamps on back up schedule to enable point‑in‑time restore.
- Tenant‑scoped restore: use export/restore with
tenant_idpredicate viaAS OF SYSTEM TIME+SELECT INTOstaging + validated merge (playbook below).
Operational Runbook (excerpt)¶
- Incident triage: identify tenant/environment, time window, affected tables.
- Quarantine writes: toggle tenant read‑only via Policy Engine; invalidate Redis keys.
- Staging restore: create temporary DB from the latest full+incremental or PITR to the timestamp.
- Diff & verify: compare
CONFIG_VERSIONbycontent_hash; verify signatures/HMAC; rehearse on staging. - Merge: upsert corrected rows by
(tenant_id, set_id, semver)into prod; rebuildis_headwhere needed (transaction). - Warm cache: repopulate Redis; emit RefreshRequested events.
- Close incident: re‑enable writes; attach audit and post‑mortem.
Observability & Capacity Planning¶
Key Metrics¶
- CRDB:
sql_bytes,ranges,qps,txn_restarts,kv.raft.process.logcommit.latency, per‑region leaseholder distribution. - Redis:
hits/misses,evictions,blocked_clients,latency,keyspace_per_tenant. - ECS: cache hit ratio (SDK), p95 config fetch latency, rollout convergence time.
Back‑of‑Envelope Sizing (initial)¶
- Avg config payload: 8–32 KB (compressed on wire).
- Tenant cardinality:
N_tenants; each with ~Aapps ×Eenv ×Ssets ×Vversions. - Storage ≈
N * A * E * S * V * 24 KB+ indexes (~1.6×). Start with 3‑region, 9‑node CRDB (3 per region,n2-standard-8class) and 3× Redis shards per region.
Integration with Refresh Orchestrator¶
- On
CONFIG_VERSIONcommit (is_head=trueflip), emitConfigHeadChangedevent. - Orchestrator:
- Upsert compiled payload to Redis
{tenant}shard in write region. - Publish
ecs:invalidate:{tenantId}with(setKey, semver, content_hash). - For canary/waves, publish scoped invalidations using
{selector}.
- Upsert compiled payload to Redis
Security Considerations¶
- Row‑level scoping by
tenant_idenforced in all data access paths; Gateway injects claims → service verifies. - At‑rest encryption (CRDB/TDE if available) + backup encryption with KMS.
- Redis: TLS, AUTH/ACLs, key prefix isolation, and value MAC (HMAC of
content_hash+ tenant secret) validated by SDKs.
Solution Architect Notes¶
- If config content > 128 KB per set becomes common, move content to object storage and store pointers + hashes in CRDB; keep small materialized views for hot paths in Redis.
- Validate whether per‑tenant PITR is required for Free/Pro; if not, restrict to Enterprise to control storage cost.
- Decide diff strategy default (
jsonpatchvs semantic) per SDK’s needs; semantic tends to be friendlier for rollout previews.
Acceptance Criteria (engineering hand‑off)¶
- DDL migrations emitted with multi‑region locality, hash‑sharded PKs, and required indexes.
- Row‑level TTL configured for audit / diff tables with edition‑aware durations.
- Redis cluster charts (Helm) with per‑region topology, metrics, and alerting; key naming spec published to SDK teams.
- Backup schedules, KMS bindings, and restore playbook present in runbooks; table‑level rehearsal completed in staging.
- Telemetry dashboards showing p95 CRUD, cache hit %, restore drill duration, and rollout convergence.
Eventing & Change Propagation — CloudEvents, topics, delivery semantics, DLQs, replay; MassTransit conventions¶
Objectives¶
Define the event model and propagation pipeline used by ECS to notify SDKs/services of configuration changes with predictable semantics:
- CloudEvents 1.0 envelopes (JSON structured-mode).
- Clear topic map & naming.
- At‑least‑once delivery with idempotency, ordering keys, DLQs, and replay.
- MassTransit conventions for Azure Service Bus (default) and RabbitMQ (alt).
Event Envelope (CloudEvents)¶
Transport: JSON (structured mode) over AMQP (ASB) or AMQP 0‑9‑1 (Rabbit); HTTP/gRPC use binary mode when needed.
Required attributes
specversion: "1.0"id: string— globally unique per eventsource: "ecs://{region}/{service}/{resource}"— e.g.,ecs://eu-central/config-registry/config-sets/4b...type: "ecs.{domain}.v1.{EventName}"— e.g.,ecs.config.v1.ConfigPublishedsubject: "{tenantId}/{pathOrSetId}"— routing hint for consumerstime: RFC3339
Extensions (multi‑tenant & lineage)
tenantid: stringeditionid: stringenvironment: string(dev|test|staging|prod)etag: stringversion: string(snapshot or semver)correlationid: string(propagated from API)actor: string(sub@iss)region: string(emit region)
Example (structured)
{
"specversion": "1.0",
"id": "e-9f1b2f7f-b49e-4f5a-90d0-93a7a6a4b0ef",
"source": "ecs://eu-central/config-registry/config-sets/4b1c",
"type": "ecs.config.v1.ConfigPublished",
"subject": "tnt-7d2a/apps/billing/sets/runtime",
"time": "2025-08-24T09:12:55Z",
"tenantid": "tnt-7d2a",
"editionid": "enterprise",
"environment": "prod",
"version": "v1.8.0",
"etag": "9oM1hQ...",
"correlationid": "c-77f2...",
"region": "eu-central",
"data": {
"setId": "4b1c...",
"path": "apps/billing/**",
"changes": { "breaking": 0, "additive": 3, "neutral": 1 }
}
}
Topic Map & Naming¶
Exchange/Topic names (canonical, kebab-case):
ecs.config.events— lifecycle of config sets & versionsecs.config.v1.ConfigDraftSavedecs.config.v1.ConfigPublishedecs.config.v1.ConfigRolledBack
ecs.policy.events— policy/edition/schema changesecs.policy.v1.PolicyUpdated
ecs.refresh.events— cache targets & client refresh signalsecs.refresh.v1.CacheInvalidatedecs.refresh.v1.RefreshRequested
ecs.adapter.events— external provider sync resultsecs.adapter.v1.SyncCompletedecs.adapter.v1.SyncFailed
Routing/partition keys
- Primary key:
tenantid - Secondary (optional):
pathPrefixorsetId - Ensures per‑tenant ordering and hot-tenant isolation.
Queue/Subscription naming (MassTransit)
- Consumer queues:
ecs-{service}-{consumer}-{env}(e.g.,ecs-refresh-orchestrator-configpublished-prod) - Error/DLQ:
{queue}_error,{queue}_skipped(parking lot) - Scheduler (delayed):
ecs-scheduler(uses ASB/Rabbit delayed delivery plugin where available)
Delivery Semantics¶
| Property | Choice | Rationale |
|---|---|---|
| Delivery | At‑least‑once | Simpler guarantees; consumers must be idempotent |
| Ordering | Best‑effort global, ordered per (tenantid, key) |
Partitioning by tenant keeps most flows ordered |
| Idempotency | Required at consumers | Dedupe by id (event store) or (tenantid, path, etag) |
| Visibility | Structured CloudEvents | Uniform across bus/HTTP |
| Fan‑out | Topic → consumer queues | Decoupled consumers; backpressure per queue |
Consumer idempotency keys
- Config state changes:
(tenantid, setId, version|etag) - Cache invalidation:
(tenantid, path, etag) - Adapter sync:
(tenantid, provider, cursor)
Retries, DLQs, and Parking Lots¶
Retry policy (MassTransit middleware)
- Immediate retries: 3
- Exponential retries: 5 attempts, 2s → 1m (jitter)
- Circuit‑break: open after 50 failures/60s; half-open after 30s
DLQ
- Poison messages route to
{queue}_errorwith headers:mt-fault-message,mt-reason,mt-host,mt-exception-type,stacktrace- CloudEvents attributes echoed for triage (
tenantid,type,id)
- Parking lot
{queue}_skippedfor known non‑actionable events (e.g., stale versions), enabling manual replay later.
Operational actions
- Retry single message: move from DLQ → main queue
- Bulk reprocess: export DLQ to blob, filter, re‑enqueue via Replay Worker
Replay Strategy¶
Sources
- Outbox/Audit log (authoritative): every emitted event recorded with
status=Emitted|Pending. - Broker retention (short): not guaranteed for long windows → rely on outbox.
Replay worker
- Input: time/tenant filter or cursor range
- Reads outbox, re‑emits CloudEvents with new
idanddata.replayOf="<original-id>"(to avoid consumer dedupe drop) - Throttled per tenant; writes operator annotations in
meta(reason: "replay", ticket: ...)
Consumer expectations
- Treat
replayOfas informational; still dedupe on currentid - Business logic must tolerate duplicate state transitions
Outbox & Inbox Patterns¶
Outbox (publisher side, e.g., Config Registry)
- Within the same transaction as state change:
- Append
Outbox(tenantid, type, subject, data, eventId, status='Pending')
- Append
- Background dispatcher (MassTransit) reads
Pending, publishes, marksEmitted - Guarantees atomicity between DB write and event publication
Inbox (consumer side, optional)
- Table
Inbox(eventId, consumer, processedAt) - Before handling, check presence; after success, insert → ensures exactly‑once processing per consumer
MassTransit Conventions (ASB default, Rabbit alt)¶
Common
- EntityNameFormatter = kebab-case
- Endpoint per message type disabled; use explicit endpoints per service
- Prefetch tuned per consumer: start
= 32–128, adjust by p99 - ConcurrentMessageLimit per handler: start
= CPU cores * 2 - Observability: OpenTelemetry enabled; activity name =
ecs.{messageType}; tags:tenantid,editionid,environment,etype
Azure Service Bus
cfg.UsingAzureServiceBus((context, sb) =>
{
sb.Host(connStr);
sb.Message<ConfigPublished>(m => m.SetEntityName("ecs.config.events"));
sb.SubscriptionEndpoint<ConfigPublished>("ecs-refresh-orchestrator-configpublished-prod", e =>
{
e.ConfigureConsumer<ConfigPublishedConsumer>(context);
e.PrefetchCount = 128;
e.MaxAutoRenewDuration = TimeSpan.FromMinutes(5);
e.EnableDeadLetteringOnMessageExpiration = true;
e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(2), TimeSpan.FromMinutes(1), TimeSpan.FromSeconds(3)));
e.UseCircuitBreaker(cb => cb.ResetInterval = TimeSpan.FromSeconds(30));
e.UseInMemoryOutbox(); // plus persistent Outbox at publisher
});
});
RabbitMQ
cfg.UsingRabbitMq((context, rmq) =>
{
rmq.Host(host, "/", h => { h.Username(user); h.Password(pass); });
rmq.Message<ConfigPublished>(m => m.SetEntityName("ecs.config.events"));
rmq.ReceiveEndpoint("ecs-refresh-orchestrator-configpublished-prod", e =>
{
e.Bind("ecs.config.events", x => { x.RoutingKey = "ecs.config.v1.ConfigPublished"; x.ExchangeType = ExchangeType.Topic; });
e.PrefetchCount = 64;
e.UseMessageRetry(r => r.Exponential(5, TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(60), TimeSpan.FromSeconds(3)));
e.UseDelayedRedelivery(r => r.Intervals(TimeSpan.FromSeconds(10), TimeSpan.FromSeconds(30)));
e.UseInMemoryOutbox();
});
});
Standard Event Contracts (data payloads)¶
ecs.config.v1.ConfigPublished¶
{
"setId": "4b1c...",
"path": "apps/billing/**",
"snapshotId": "S-00F",
"version": "v1.8.0",
"etag": "9oM1hQ...",
"policy": { "breaking": false, "violations": [] }
}
ecs.refresh.v1.CacheInvalidated¶
{
"scope": { "path": "apps/billing/**", "environment": "prod" },
"targets": ["redis", "sdk"],
"reason": "publish",
"etag": "9oM1hQ..."
}
ecs.adapter.v1.SyncCompleted¶
{
"provider": "azure-appconfig",
"cursor": "w/1724490000",
"items": 128,
"changes": { "upserts": 120, "deletes": 8 }
}
End‑to‑End Flow (publish → refresh)¶
sequenceDiagram
participant UI as Studio
participant API as Registry
participant OB as Outbox
participant BUS as Event Bus
participant ORC as Refresh Orchestrator
participant RED as Redis
participant SDK as Client SDK
UI->>API: POST /config-sets/{id}/snapshots (publish)
API-->>OB: tx write + outbox pending(ConfigPublished)
API-->>UI: 201 Snapshot/Version
OB-->>BUS: publish CloudEvent (ConfigPublished)
BUS-->>ORC: deliver
ORC->>RED: scoped invalidation
ORC-->>BUS: publish CacheInvalidated
BUS-->>SDK: deliver (via WS bridge) / notify
SDK->>API: Resolve(etag) → 200/304
Backpressure & Throttling¶
- Per‑tenant quotas (see Gateway cycle); event consumers must:
- Pause on
RESOURCE_EXHAUSTED(MassTransit retry with backoff) - Limit in‑flight resolves on refresh storms (coalesce by
(tenantid,path)within 500 ms window)
- Pause on
- Orchestrator maintains bounded buffers; on overflow, drops to parking lot and raises
RefreshDegradationalert.
Security¶
- Bus credentials scoped to producer/consumer roles (least privilege).
- CloudEvents
actorandtenantidare validated from JWT at publisher; consumers must not trust unvalidated extensions from third parties. - Events avoid embedding secret values; only hashes/refs.
Observability & Alerting¶
Spans
ecs.publish,ecs.invalidate,ecs.replay- Attributes:
tenantid,event.type,queue,delivery.attempt
Metrics
- Publish success rate, end‑to‑end propagation lag (publish→SDK resolve), DLQ size, replay throughput
- Consumer handler p95/p99, retries, circuit state
Alerts
PropagationLagP95>5s (5m)DLQSize>100 (10m)per queueReplayFailureRate>1%
Acceptance Criteria (engineering hand‑off)¶
- MassTransit bus configuration (ASB + Rabbit) with kebab-case entities, retries, DLQs, OTEL.
- CloudEvents envelope library (shared) with validation and extensions.
- Outbox dispatcher and Replay worker with CLI & Studio hooks.
- Contract tests: idempotency, partition ordering, DLQ/replay, propagation lag budget.
- Runbooks: DLQ triage, targeted replay, tenant backpressure override.
Solution Architect Notes¶
- Start with per‑tenant partitioning; introduce path‑sharded partitions if a few tenants dominate traffic.
- Consider compaction (e.g., Kafka) when adding an analytics/event‑sourcing lane; current MVP favors ASB/Rabbit simplicity.
- Validate mobile clients via server‑side WS bridge that consumes
ecs.refresh.eventsand fans out over WebSocket/SSE with CloudEvents.
Refresh & Invalidation Flows — SDK pull/long‑poll/websocket, server push, cache stampede protection, ETags¶
Objectives¶
Design the end‑to‑end cache refresh and invalidation path that keeps SDKs and services current within seconds, while protecting the platform from thundering herds and ensuring multi‑tenant isolation.
Outcomes
- Three client models: periodic pull, long‑poll, WebSocket/SSE push.
- Server‑side targeted invalidation and coalesced refresh.
- ETag/conditional fetch as the primary coherency mechanism (304/
not_modified). - Stampede control at SDK and edge with singleflight + distributed locks and stale‑while‑revalidate.
Control Plane vs Data Plane¶
| Plane | Responsibility | Tech |
|---|---|---|
| Control | Propagate change intent | CloudEvents (ConfigPublished, CacheInvalidated) |
| Data | Deliver effect (new config) | REST GET /resolve (ETag) and gRPC Resolve/ResolveBatch |
Principle: Control plane never carries config values; clients always refetch using ETag‑aware reads.
Client Models¶
1) Periodic Pull (baseline)¶
- SDK timer fetches with
If-None-Match: <etag>every T seconds (edition‑aware default: Starter=30s, Pro=10s, Enterprise=5s). - Pros: simplest, firewall‑friendly.
- Cons: higher background traffic; longer staleness.
2) Long‑Poll (recommended default)¶
- SDK calls
/resolve?waitForChange=true&timeout=30s(or gRPCResolvewith deadline). - Server blocks until new ETag or timeout →
200with value or304 Not Modified. - Pros: low idle traffic, near‑real‑time without persistent sockets.
3) WebSocket / SSE Push (premium)¶
- SDK subscribes to Refresh Channel (WebSocket or gRPC streaming).
- Server pushes RefreshEvent(etag, path, cursor); SDK revalidates via
Resolve. - Pros: sub‑second fanout, very low latency.
- Cons: long‑lived connections; need heartbeats and resumption.
Flow Diagrams¶
A) Publish → Fanout → Client Revalidate¶
sequenceDiagram
participant Studio as Studio UI
participant Registry as Config Registry
participant Orchestrator as Refresh Orchestrator
participant Redis as Redis Cache
participant SDK as App SDK
Studio->>Registry: Publish snapshot (idempotent)
Registry-->>Redis: DEL tenant:path:* (scoped invalidation)
Registry->>Orchestrator: emit ConfigPublished(tenant, path, etag)
Orchestrator-->>SDK: push RefreshEvent(tenant, path, etag) [WS/L.Poll wake]
SDK->>Registry: Resolve(path, If-None-Match: oldEtag)
alt New ETag
Registry-->>SDK: 200 value + ETag(new)
SDK-->>SDK: Update L1 cache
else No change
Registry-->>SDK: 304 Not Modified
end
B) Long‑Poll Resolve (HTTP)¶
sequenceDiagram
participant SDK
participant GW as API Gateway
participant API as Resolve API
SDK->>GW: GET /resolve?path=...&waitForChange=true&timeout=30 (If-None-Match: etag)
GW->>API: forward (deadline=30s)
API-->>API: await new etag OR timeout (register waiter keyed by tenant+path)
alt etag changed
API-->>SDK: 200 value + ETag
else timeout
API-->>SDK: 304 Not Modified
end
C) WebSocket Refresh (server push)¶
sequenceDiagram
participant SDK
participant Push as Refresh WS Bridge
SDK->>Push: WS CONNECT /ws/refresh (x-tenant-id, JWT)
Push-->>SDK: HEARTBEAT 30s
Push-->>SDK: RefreshEvent(path, etag, cursor)
SDK->>Push: ACK cursor (bidi) OR implicit via next pull
SDK->>API: Resolve(If-None-Match: etag)
Server‑Side Invalidation & Coalescing¶
Targets
- Primary: Redis keys
ecs:{tenant}:{env}:{set}:{path} - Secondary: In‑process
L2cache (optional) invalidated by local event bus.
Coalescing
- The Resolve API maintains a waiter map per
(tenant, path):- On publish, it wakes all waiters and debounces new waiters for 200–500 ms to batch misses.
- The Orchestrator emits single refresh signals per
(tenant, path, etag)and coalesces bursts within 250 ms windows.
SDK Caching & ETag Strategy¶
Cache layers¶
- L1 (in‑process) with soft TTL and ETag index.
- L2 (Redis) on server‑side when SDK sits behind a service (optional; SDK still resolves via API).
ETag rules¶
- Treat ETag as strong validator of materialized config (secrets still references).
- Always send
If-None-Matchon Resolve. - Respect
not_modified/304 to preserve L1 value and extend soft TTL.
Suggested SDK algorithm (pseudo)¶
function getConfig(path, ctx):
key = (tenant, ctx, path)
entry = L1.get(key)
if entry && entry.fresh(): return entry.value
// singleflight per key to prevent stampede
v = singleflight(key, () => {
etag = entry?.etag
resp = Resolve(path, ctx, if_none_match=etag, deadline=250ms)
if resp.not_modified:
entry.touch()
return entry.value
else:
L1.put(key, resp.resolved, resp.etag, soft_ttl(ctx))
return resp.resolved
})
return v
Stampede Protection¶
| Layer | Technique | Notes |
|---|---|---|
| SDK | singleflight per (tenant, ctx, path) |
Collapse concurrent calls inside a process. |
| API | request collapsing + memoization for active resolves | Reuse upstream result for waiters. |
| Redis | distributed lock on cache fill (SET NX PX) | Timeout ≤ 250 ms; losers get stale‑while‑revalidate. |
| TTL | stale‑while‑revalidate (SWR) | Serve stale for ≤ 2s while a single refresher fetches. |
| Jitter | TTL jitter ±10–20% | Spread expirations to avoid sync spikes. |
| Backoff | jittered exponential after 429/503 | Max backoff 2s for reads. |
Negative caching: For consistent 404 resolves, cache NEGATIVE(etag) for short TTL (≤ 5s) to avoid hammering.
Long‑Poll API Contract (REST)¶
GET /api/v1/resolve?path={p}&env={e}&version=latest&waitForChange=true&timeout=30
Headers:
If-None-Match: "<etag>"
Responses:
200 OK body: value, headers: ETag: "<new>"
304 Not Modified (on timeout or same etag)
Server behavior
- Max
timeout= 30s (configurable). - Registers waiter; returns
304on timeout to allow client to re‑issue without breaking caches.
WebSocket/SSE Contract (HTTP)¶
Handshake
GET /ws/refresh(WS) orGET /sse/refresh(SSE) withAuthorizationandx-tenant-id.
Messages (JSON)
// server -> client
{ "type":"refresh", "cursor":"c-12345", "path":"apps/billing/**", "etag":"9oM1hQ..." }
{ "type":"heartbeat", "ts":"2025-08-24T11:00:00Z" }
{ "type":"nack", "code":"ECS.SUBSCRIPTION.INVALID", "retryAfterMs":2000 }
// client -> server (WS only)
{ "type":"ack", "cursor":"c-12345" }
Resumption
- Client reconnects with
?resumeAfter=c-<cursor>to avoid gaps; server backfills from replay window.
gRPC Alignment (from Contracts cycle)¶
ResolveRequest.if_none_match_etag→ResolveResponse.not_modified/etag.RefreshChannel.Subscribe/SubscribeWithAckprovide the push lane; SDKs always revalidate withResolve.
Health, Heartbeats & Timeouts¶
- WS/SSE heartbeat every 15–30s; close on 3 missed heartbeats.
- Long‑poll deadline recommended = timeout + 2s.
- SDK should rotate tokens before expiry; reconnect on
UNAUTHENTICATED/401.
Rate Governance & Flood Safety¶
- Gateway applies per‑tenant read QPS limits (edition‑aware).
- Orchestrator coalesces and drops duplicates within a 250 ms window; emits a single refresh per ETag.
- Clients must obey
Retry-Afteron 429 and reduce concurrent resolves (cap to 2 in‑flight per process).
Failure Modes & Degradation Paths¶
| Failure | Client Behavior | Server Behavior |
|---|---|---|
| WS bridge down | Fallback to long‑poll; increase T by ×2 |
Auto‑heal; drain buffers; send replayOf on resume |
| Event bus lag | Continue periodic pull; widen interval by +50% | Alert; protect Redis with SWR |
| Redis unavailable | Resolve from DB (higher latency) | Bypass Redis; apply per‑tenant circuit breaker |
| ETag mismatch loops | Full resolve without If-None-Match once; then resume ETag path |
Log and re‑materialize canonical cache line |
| Rate‑limit 429 | Jittered backoff (100–500 ms) | Emit RateLimitExceeded audit; expose limits in headers |
Observability¶
SDK emits
ecs.sdk.resolve.latency(p50/p95/p99),cache.hit/miss,lp.wakeups,ws.reconnects,etag.rotate.count.
Server dashboards
- Propagation lag (publish → first 200 Resolve)
- Waiter pool size & coalescing ratio
- Redis lock wait time & lock contention
- Long‑poll saturation (% requests returning 304 on timeout)
Acceptance Criteria (engineering hand‑off)¶
- Implement waiter registry with safe cancellation; unit tests for coalescing and timeout behavior.
- Add singleflight utility in SDKs (.NET/JS/Mobile) with per‑key granularity.
- Redis invalidation with scoped keys and distributed lock on fill.
- WS/SSE bridge with heartbeats, resume tokens, per‑tenant connection caps.
- End‑to‑end tests: publish → SDK updates within <5s (Pro/Ent), <15s (Starter) under load, no stampede.
Solution Architect Notes¶
- Prefer long‑poll as default mode across SDKs; enable WS behind a feature flag per tenant/edition.
- Keep SWR≤2s to bound staleness while avoiding spikes.
- Consider a per‑tenant, per‑path moving window to limit redundant resolves (e.g., ignore duplicate refresh events for 200 ms).
- For mobile, use SSE when WS is blocked; keep battery impact minimal with adaptive backoff.
Provider Adapter Hub – Plug‑in Model, Contracts & Lifecycles¶
This section specifies the Provider Adapter Hub that bridges ECS with external configuration backends (Azure App Configuration, AWS AppConfig, Consul, Redis, SQL/CockroachDB). It defines the plug‑in model, capability contracts, adapter lifecycles, multi‑tenant isolation, and operational guarantees so Engineering Agents can implement adapters consistently and safely.
Architectural Context¶
flowchart LR
subgraph ECS Core
Hub[Adapter Hub]
Registry[Config Registry]
Policy[Policy Engine]
Refresh[Refresh Orchestrator]
EventBus[[Event Bus (CloudEvents)]]
end
subgraph Adapters (Out-of-Process Plugins)
AAZ[Azure AppConfig Adapter]
AAW[AWS AppConfig Adapter]
ACS[Consul KV Adapter]
ARD[Redis Adapter]
ASQL[SQL/CockroachDB Adapter]
end
Hub <-- gRPC SPI --> AAZ
Hub <-- gRPC SPI --> AAW
Hub <-- gRPC SPI --> ACS
Hub <-- gRPC SPI --> ARD
Hub <-- gRPC SPI --> ASQL
Registry <--> Hub
Policy <--> Hub
Refresh <--> Hub
EventBus <--> Hub
Design stance
- Out‑of‑process adapters run as isolated containers (per provider) and expose a gRPC Service Provider Interface (SPI) to the Hub.
- Capability-driven: Each adapter declares supported feature flags (e.g., hierarchical keys, ETag, watch/stream, transactions).
- Multi‑tenant aware: One adapter process may serve multiple tenants via namespaced bindings with per-tenant credentials.
- Event-first: All changes propagate as CloudEvents, with at‑least‑once delivery and idempotency keys.
Plug‑in Packaging & Discovery¶
| Aspect | Requirement |
|---|---|
| Image Layout | OCI container, label io.connectsoft.ecs.adapter=true with adapter.id, adapter.version, capabilities annotations |
| SPI Transport | gRPC over mTLS (cluster DNS). Optional Unix domain socket for sidecar deployments. |
| Registration | Adapter posts a Manifest to Hub on startup (Register()), Hub persists in Registry. |
| Upgrades | Rolling upgrade supported; Hub revalidates capabilities and drains in‑flight calls. |
| Tenancy | Dynamic Bindings created per tenant/environment via Hub API; Hub passes scoped credentials to adapter using short‑lived secrets. |
Adapter Manifest (example)
apiVersion: ecs.connectsoft.io/v1alpha1
kind: AdapterManifest
metadata:
adapterId: azure-appconfig
version: 1.4.2
spec:
provider: AzureAppConfig
capabilities:
hierarchicalKeys: true
etagSupport: true
watchStreaming: true
transactions: false
bulkList: true
tags: ["labels","contentType"]
inputs:
- name: connection
type: secretRef # provided by Hub at Bind time
- name: storeName
type: string
limits:
maxKeySize: 1024
maxValueSize: 1048576
maxBatch: 500
events:
emits: ["ExternalChangeObserved","SyncJobCompleted"]
consumes: ["ConfigPublished","BindingRotated"]
security:
requiredScopes: ["kv.read","kv.write"]
secretTypes: ["azure:clientCredentials"]
gRPC SPI (Service Provider Interface)¶
Proto (excerpt)
syntax = "proto3";
package ecs.adapters.v1;
message BindingRef {
string tenant_id = 1;
string environment = 2; // dev|staging|prod
string namespace = 3; // app/service scope
string binding_id = 4; // unique per tenant+namespace
}
message CredentialEnvelope {
string provider = 1; // e.g., AzureAppConfig
bytes payload = 2; // KMS-encrypted secret blob
string kms_key_ref = 3;
string version = 4;
int64 not_after_unix = 5; // lease expiry
}
message BindRequest {
BindingRef binding = 1;
CredentialEnvelope credential = 2;
map<string,string> options = 3; // e.g., storeName, region, prefix
}
message Key {
string path = 1; // normalized ECS path: /apps/{app}/env/{env}/[...]
string label = 2; // provider-specific label/namespace
}
message Item {
Key key = 1;
bytes value = 2; // arbitrary bytes (JSON, text, binary)
string content_type = 3;
string etag = 4; // provider etag/version token
map<string,string> meta = 5;
}
message GetRequest { BindingRef binding = 1; Key key = 2; string if_none_match = 3; }
message GetResponse { Item item = 1; bool not_modified = 2; }
message PutRequest {
BindingRef binding = 1;
Item item = 2;
string if_match = 3; // CAS on etag
bool upsert = 4;
string idempotency_key = 5;
}
message PutResponse { Item item = 1; }
message ListRequest { BindingRef binding = 1; string prefix = 2; int32 page_size = 3; string page_token = 4; }
message ListResponse { repeated Item items = 1; string next_page_token = 2; }
message DeleteRequest { BindingRef binding = 1; Key key = 2; string if_match = 3; string idempotency_key = 4; }
message DeleteResponse {}
message WatchRequest { BindingRef binding = 1; string prefix = 2; string resume_token = 3; }
message ChangeEvent {
string change_id = 1; // for idempotency
string type = 2; // upsert|delete
Item item = 3;
string resume_token = 4; // bookmark for stream resumption
}
service ProviderAdapter {
rpc RegisterManifest(google.protobuf.Empty) returns (AdapterInfo);
rpc Bind(BindRequest) returns (BindingAck);
rpc Validate(BindRequest) returns (ValidationResult);
rpc Get(GetRequest) returns (GetResponse);
rpc Put(PutRequest) returns (PutResponse);
rpc Delete(DeleteRequest) returns (DeleteResponse);
rpc List(ListRequest) returns (ListResponse);
rpc Watch(WatchRequest) returns (stream ChangeEvent); // server-streamed
rpc Health(google.protobuf.Empty) returns (HealthStatus);
rpc Unbind(BindingRef) returns (google.protobuf.Empty);
}
Contract notes
- Idempotency:
idempotency_keyrequired forPut/Delete. Adapters must persist idempotency window (e.g., 24h) to dedupe retries. - ETag/CAS: Use
if_match/if_none_matchfor concurrency control where provider supports it; otherwise emulate with version vectors. - Pagination: mandatory stable ordering by key path; opaque
page_token.
Binding & Namespacing¶
Each tenant/environment maps to a provider namespace:
| Provider | Namespace Strategy |
|---|---|
| Azure AppConfig | key = {prefix}:{app}:{env}:{path}, label = {namespace} (labels for environment or slot) |
| AWS AppConfig | Application/Profile/Environment map to ECS namespace/env; versions map to deployments |
| Consul KV | /{tenant}/{env}/{namespace}/{path} with ACL tokens per binding |
| Redis | key = ecs:{tenant}:{env}:{namespace}:{path}, optional hash for grouping; TTL not used for config |
| CockroachDB (SQL) | Table per tenant partitioned by {env, namespace} with composite PK (key, version) |
Key normalization
- ECS canonical path:
/apps/{app}/env/{env}/services/{svc}/paths/... - Adapters implement deterministic mapping and store reverse mapping in
metato support round‑trip and diffs.
Adapter Lifecycle (State Machine)¶
stateDiagram-v2
[*] --> Discovered
Discovered --> Registered: RegisterManifest()
Registered --> Bound: Bind()+Validate()
Bound --> Active: Health(ok)
Active --> RotatingCreds: BindingRotated
RotatingCreds --> Active: re-Validate()
Active --> Draining: Unbind() requested / Upgrade
Draining --> Unbound: no inflight
Unbound --> [*]
Active --> Degraded: Health(warn/fail)
Degraded --> Active: Auto-recover/Retry
Degraded --> Draining: Manual intervention
Lifecycle hooks
Validate()must perform least‑privilege checks and a dry‑run (list one key, check write if permitted).Unbind()drains Watch streams and completes in-flight mutations; Hub retries if deadlines expire.
Change Propagation & Sync¶
Flows
-
ECS → Provider (Publish) Registry emits
ConfigPublished→ Hub resolves bindings →Put()batch to adapter → on success, Hub emitsSyncJobCompleted. -
Provider → ECS (Observe) Adapter
Watch(prefix)streamsChangeEventfor external drifts → Hub translates toExternalChangeObserved→ Policy Engine validates → Registry applies or flags drift.
sequenceDiagram
participant Registry
participant Hub
participant Adapter as AzureAdapter
participant Provider as AzureAppConfig
Registry->>Hub: ConfigPublished(tenant/app/env)
Hub->>Adapter: Put(batch, idempotency_key)
Adapter->>Provider: PUT keys (CAS/etag)
Provider-->>Adapter: 200 + etags
Adapter-->>Hub: PutResponse(items)
Hub-->>Registry: SyncJobCompleted
Delivery semantics
- At‑least‑once end‑to‑end with idempotency on adapter side.
- Backpressure: Hub enforces max in‑flight per binding and exponential backoff per provider rate limits.
- DLQ: Failed mutations or change translations publish to
ecs.adapters.dlqwith replay tokens.
Diffing, Snapshot, Rollback¶
- List(prefix) yields provider snapshot; Hub computes 3‑way diff (desired, provider, last‑applied).
- Rollback uses ECS Versioning to restore prior desired state; Hub re‑publishes with CAS on previous etags where possible.
- Partial capability: If provider lacks ETag, Hub marks binding non‑concurrent, serializes writes.
Security & Secrets¶
| Concern | Pattern |
|---|---|
| Credentials | Short‑lived CredentialEnvelope issued by Hub via cloud KMS/KeyVault/Secrets Manager; rotate via BindingRotated event. |
| Network | mTLS between Hub↔Adapter; egress to provider via provider’s SDK with TLS1.2+. |
| Least Privilege | Scoped roles per binding (read-only vs read/write). |
| Auditing | All SPI calls include traceId, tenantId, actor; adapters must log structured spans. |
| Data Privacy | No plaintext secrets in logs; PII scrubbing middleware in adapters. |
Observability & Health¶
Metrics (per binding)
adapter_requests_total{op}adapter_request_duration_ms{op}adapter_throttle_events_totaladapter_watch_gaps_total(stream gaps/restarts)adapter_idempotent_dedup_hits_totaladapter_errors_total{type=auth|rate|transient|fatal}
Health
HealthStatusreturnsstatus,since, andchecks[]including auth, reachability, quota.
Provider‑Specific Nuances¶
Azure App Configuration¶
- Use
labelfor env/namespace,contentTypefor metadata. - Leverage
If‑None‑Matchfor conditional GET;ETagon PUT/CAS. - Watch via push notifications (Event Grid) optional; otherwise adapter polling with ETag bookmarks.
AWS AppConfig¶
- Publishing often uses Hosted Config with Deployments; adapter wraps version promotion rather than per‑key PUT.
- Changes are profile+environment scoped; adapter maps ECS keys to a JSON document (schema‑validated).
Consul KV¶
- Native blocking queries support; adapter implements
Watchusing index token (long‑poll). - ACL tokens per binding; transactions (if enabled) can apply atomic batches.
Redis¶
- Use keyspace notifications where available; otherwise use SCAN + version fields.
- Prefer HASH records for grouped configuration with a version field to emulate ETag.
SQL/CockroachDB¶
- Schema (simplified):
create table config_items (
tenant_id uuid not null,
environment text not null,
namespace text not null,
path text not null,
version int8 not null default 1,
content_type text,
value bytea not null,
etag text not null,
updated_at timestamptz not null default now(),
primary key (tenant_id, environment, namespace, path)
);
create index on config_items (tenant_id, environment, namespace);
- ETag computed as
sha256(value||version); changes streamed via CDC if enabled.
Error Model¶
| Code | Meaning | Hub Behavior |
|---|---|---|
ALREADY_EXISTS |
CAS conflict (etag mismatch) | Retry with latest Get() and policy merge |
FAILED_PRECONDITION |
Invalid binding/perm | Mark binding degraded, alert Security |
RESOURCE_EXHAUSTED |
Throttled/quota | Backoff (jitter), adjust batch size |
UNAVAILABLE |
Provider outage | Circuit break per binding; queue for replay |
DEADLINE_EXCEEDED |
Timeout | Retry with exponential backoff; consider reducing page/batch |
All responses must include retry_after_ms hint where applicable.
Rate Limits & Concurrency¶
- Hub defaults: 8 concurrent ops per binding, 256 per adapter process.
- Adapter declares provider limits in Manifest; Hub tunes concurrency dynamically (token bucket).
- Large publications use chunking with
maxBatchfrom Manifest.
Compliance & Governance¶
- Adapters must pass conformance tests:
- CRUD contract, idempotency, ETag/CAS behavior
- Watch stream continuity (resume token)
- Backpressure under synthetic throttling
- Multi‑tenant isolation (no cross‑leak with wrong credentials)
- Supply SBOM and sign container images (Sigstore).
Example: Binding Creation (REST)¶
POST /v1/adapters/azure-appconfig/bindings
Content-Type: application/json
{
"tenantId": "t-123",
"environment": "prod",
"namespace": "billing",
"options": { "storeName": "appcfg-prod", "prefix": "ecs" },
"credentialRef": "kv://secrets/azure/appcfg/billing-prod",
"permissions": ["read","write"]
}
Hub resolves credentialRef → creates CredentialEnvelope → calls Bind() then Validate() on adapter.
Testing Matrix (Adapter Conformance)¶
| Area | Scenario | Expected |
|---|---|---|
| Idempotency | Retry Put() with same idempotency_key |
Exactly one write at provider |
| CAS | Put(if_match=stale_etag) |
ALREADY_EXISTS, no mutation |
| Watch | Kill adapter, restart with resume_token |
Stream resumes without loss |
| Pagination | List() with small page_size |
Complete coverage, stable ordering |
| Drift | Provider side key modified | ExternalChangeObserved fired |
| Throttle | Provider returns 429/limit | Backoff, no hot‑loop |
Solution Architect Notes¶
- Sidecar vs central adapter: For high‑churn namespaces, deploy adapter sidecar to reduce latency; otherwise run shared adapters per provider with strong isolation at binding layer.
- Schema guarantees: Where providers are document‑style (AWS AppConfig), define a document aggregator in Hub that flattens ECS keyspace into a single JSON with stable ordering and checksum for CAS.
- Rollout safety: Pair publication with Refresh Orchestrator to gradually push SDK refresh (canary → region → all).
- Cost & limits: Respect provider rate/quota; tune batch sizes from Manifest during runtime using feedback metrics.
This specification enables Engineering Agents to implement provider adapters with consistent contracts, safe lifecycles, and operational reliability across heterogeneous backends, while preserving ECS’s multi‑tenant isolation and event‑driven change propagation.
SDK Design (.NET / JS / Mobile) – APIs, Caching, Offline, Diagnostics, Feature Flags, Circuit Breakers¶
This section defines the client SDKs for ECS across .NET, JavaScript/TypeScript, and Mobile (MAUI / React Native). It translates platform-neutral behaviors (auth, tenant routing, caching, refresh, and resiliency) into implementable APIs with consistent cross-language semantics. SDKs are multi-tenant aware, edition aware, and aligned with Clean Architecture and observability-first principles.
Goals & Non-Functional Targets¶
| Area | Target |
|---|---|
| Latency | p95 GetConfig() ≤ 50ms (warm cache), ≤ 250ms (cache miss, intra-region) |
| Availability | 99.95% SDK internal availability (local cache + backoff) |
| Cache Correctness | Strong read-your-writes for same client session; eventual consistency ≤ 2s with server push/long-poll |
| Footprint | < 400KB minified JS; < 1.5MB .NET DLLs; < 800KB mobile binding (excl. deps) |
| Telemetry | OTEL spans for API calls; metrics for hit/miss, refresh latency, circuit state |
| Security | OIDC access token; least-privilege scopes; mTLS optional for enterprise |
| Backwards-Compat | SemVer with non-breaking minor releases; feature flags guarded by capability negotiation |
High-Level Architecture¶
classDiagram
class EcsClient {
+GetConfig(key, options) : ConfigValue
+GetSection(path, options) : ConfigSection
+Subscribe(keys|paths, handler) : SubscriptionId
+SetContext(Tenant, Environment, App, Edition)
+FlushCache(scope?)
+Diagnostics() : EcsDiagnostics
}
class Transport {
<<interface>>
+RestCall()
+GrpcCall()
+WebsocketStream()
+LongPoll()
}
class CacheLayer {
+TryGet(key) : CacheResult
+Put(key, value, etag, ttl)
+Invalidate(key|prefix)
+PromoteToPersistent()
}
class OfflineStore {
+Read(key) : OfflineResult
+Write(snapshot)
+Enumerate()
+GC(policy)
}
class Resiliency {
+WithRetry()
+WithCircuitBreaker()
+WithTimeouts()
+WithJitterBackoff()
}
class FeatureFlags {
+IsEnabled(flag, context) : bool
+Variant(flag, context) : string|number
}
EcsClient --> Transport
EcsClient --> CacheLayer
EcsClient --> OfflineStore
EcsClient --> Resiliency
EcsClient --> FeatureFlags
Public API Surface (Cross-Language Semantics)¶
Core Concepts¶
- Context:
{ tenantId, environment, applicationId, edition } - Selector:
key(exact) orpath(hierarchical section). - Consistency Options:
{ preferCache, mustBeFreshWithin, staleOk, etag } - Subscription: Push updates for keys/paths via WS/long-poll.
.NET (C#)¶
public sealed class EcsClient : IEcsClient, IAsyncDisposable
{
Task<ConfigValue<T>> GetConfigAsync<T>(string key, GetOptions? options = null, CancellationToken ct = default);
Task<ConfigSection> GetSectionAsync(string path, GetOptions? options = null, CancellationToken ct = default);
IDisposable Subscribe(ConfigSubscription request, Action<ConfigChanged> handler);
void SetContext(EcsContext ctx);
Task FlushCacheAsync(CacheScope scope = CacheScope.App, CancellationToken ct = default);
EcsDiagnostics Diagnostics { get; }
}
public record GetOptions(TimeSpan? MustBeFreshWithin = null, bool PreferCache = true, bool StaleOk = false, string? ETag = null);
public record EcsContext(string TenantId, string Environment, string ApplicationId, string Edition);
public record ConfigValue<T>(T Value, string ETag, DateTimeOffset AsOf, string Source); // source: memory/persistent/network
SDK Registration
services.AddEcsClient(o =>
{
o.BaseUrl = new Uri("https://api.ecs.connectsoft.io");
o.Auth = AuthOptions.FromOidc(clientId: "...", authority: "...", scopes: new[] {"ecs.read"});
o.Transport = TransportMode.GrpcWithWebsocketFallback;
o.Cache = CacheOptions.Default with { MemoryTtl = TimeSpan.FromMinutes(5), Persistent = true };
o.Resiliency = ResiliencyProfile.Standard;
});
JavaScript / TypeScript¶
import { EcsClient, createEcsClient, GetOptions } from "@connectsoft/ecs";
const ecs: EcsClient = createEcsClient({
baseUrl: "https://api.ecs.connectsoft.io",
auth: { oidc: { clientId: "...", authority: "...", scopes: ["ecs.read"] } },
transport: { primary: "grpc-web", fallback: ["rest", "long-poll"] },
cache: { memoryTtlMs: 300000, persistent: true },
resiliency: "standard"
});
const value = await ecs.getConfig<string>("features/paywall/threshold", <GetOptions>{ mustBeFreshWithinMs: 1000 });
const unsub = ecs.subscribe({ keys: ["features/*"] }, change => {
console.log("config changed", change);
});
Mobile (MAUI / React Native)¶
- MAUI: NuGet package reuses .NET SDK; persistent store via SecureStorage + local DB.
- React Native: NPM package uses AsyncStorage/SecureStore for persistent cache; push via WS.
Caching Strategy¶
Two-tier cache with ETag-aware coherency and stampede protection.
| Layer | Backing | Default TTL | Notes |
|---|---|---|---|
| Memory | Concurrent in-proc map with version pins | 5m | Fast path; per-key locks to prevent thundering herd |
| Persistent | Indexed DB (Web) / AsyncStorage (RN) / LiteDB/SQLite (MAUI) | 24h (stale-read allowed) | Used when offline; writes are snapshots (configSet, etag, asOf) |
Get Flow (with ETag/Stampede Guard)¶
sequenceDiagram
participant App
participant SDK as EcsClient
participant Cache as Memory/Persistent
participant Svc as ECS API
App->>SDK: GetConfig(key, opts)
SDK->>Cache: TryGet(key)
alt Hit & Fresh
Cache-->>SDK: value, etag
SDK-->>App: value (source=memory)
else Miss or Stale
SDK->>SDK: Acquire per-key lock
SDK->>Svc: GET /configs/{key} If-None-Match: etag
alt 304 Not Modified
Svc-->>SDK: 304
SDK->>Cache: Touch(key)
SDK-->>App: value (source=cache)
else 200 OK
Svc-->>SDK: value, etag
SDK->>Cache: Put(key, value, etag, ttl)
SDK-->>App: value (source=network)
end
SDK->>SDK: Release lock
end
Options
PreferCache: return cached immediately; refresh in background.MustBeFreshWithin: ensure data age ≤ threshold; else block for refresh.StaleOk: allow stale if network unavailable (offline mode).
Offline Mode¶
- Detection: transport health + browser
navigator.onLine/platform reachability signal. - Read: return most recent snapshot from persistent store if
StaleOkorMustBeFreshWithinnot satisfied but offline. - Write (Studio-authored server-side): SDKs are read-optimized; write APIs guarded and not available in runtime SDKs by default.
- Reconciliation: on reconnect, diff by ETag and update persistent store; emit
ConfigChangedevents for subscribers.
flowchart LR
Offline[[Offline]] --> ReadFromPersistent
Reconnect[[Reconnect]] --> SyncETags --> UpdateMemory --> NotifySubscribers
Refresh & Invalidation¶
- Push: WS stream
ConfigChanged(per tenant/app/keys). SDK updates memory and persistent stores; raises handlers on the UI thread (mobile) / event loop (JS). - Long-Poll: Backoff schedule with jitter; ETag aggregation checkpoint to minimize payload.
- Manual:
FlushCache(scope)clears memory and persistent entries bykey,prefix, orapp.
Subscriber Contract
type ConfigSubscription = { keys?: string[], paths?: string[], prefix?: string };
type ConfigChanged = { key: string, newETag: string, reason: "push"|"poll"|"manual" };
Diagnostics & Telemetry¶
OpenTelemetry built-in, disabled only via explicit opt-out.
| Signal | Name | Dimensions |
|---|---|---|
| Trace span | ecs.sdk.get_config |
tenant, app, key, cache_hit(bool), etag_present(bool), attempt |
| Metric (counter) | ecs_cache_hits_total |
layer(memory/persistent), tenant, app |
| Metric (histogram) | ecs_refresh_latency_ms |
transport(rest/grpc/ws), outcome |
| Metric (gauge) | ecs_circuit_state |
service, state |
| Log | ecs.sdk.error |
exception, httpStatus, circuitState, retryAttempt |
Redaction: keys logged with hashing; values never logged unless Diagnostics().EnableValueSampling() is set in dev only.
User-Agent: ConnectSoft-ECS-SDK/{lang}/{version} ({os};{runtime})
Feature Flags API¶
Integrated lightweight evaluation with remote rules (from ECS) + local fallbacks.
public interface IEcsFlags
{
bool IsEnabled(string flag, FlagContext? ctx = null);
T Variant<T>(string flag, FlagContext? ctx = null, T defaultValue = default!);
}
public record FlagContext(string? UserId = null, string? Region = null, IDictionary<string,string>? Traits = null);
- Data Source:
features/*namespace within ECS. - Evaluation: client-side deterministic hashing (sticky bucketing), server-side override precedence.
- Safety: if no rules available, fail-closed for
IsEnabled(configurable).
Resiliency Patterns¶
Defaults provided by ResiliencyProfile.Standard (overridable per method).
| Concern | Default |
|---|---|
| Timeouts | connect: 2s, read: 1.5s, overall: 4s |
| Retries | 3 attempts, exponential backoff (100–800ms) + jitter |
| Circuit Breaker | consecutive failures ≥ 5 or failure rate ≥ 50% over 20 calls → Open 30s; Half-Open probe: 10% |
| Bulkhead | Max 64 concurrent inflight network calls per process |
| Fallback | Serve stale if available; else throw EcsUnavailableException |
Circuit State Events
Handlers may subscribe to OnCircuitChange(service, state, reason) for operational visibility.
Tenant & Edition Routing¶
- Every request carries Context headers:
x-ecs-tenant,x-ecs-environment,x-ecs-application,x-ecs-edition. - Edition shaping: SDK hides calls to non-entitled endpoints (pre-check via capability discovery at init).
- Partition-Aware: Affinity to nearest region discovered via bootstrap (DNS SRV / discovery endpoint); persisted for session.
Configuration & Bootstrapping¶
| Setting | .NET Key | JS Key | Default |
|---|---|---|---|
| Base URL | Ecs:BaseUrl |
baseUrl |
required |
| Transport | Ecs:Transport |
transport.primary |
grpc (.NET), grpc-web (web) |
| Auth | Ecs:Auth:* |
auth |
OIDC |
| Memory TTL | Ecs:Cache:MemoryTtlSeconds |
cache.memoryTtlMs |
300 |
| Persistent | Ecs:Cache:Persistent |
cache.persistent |
true |
| Push | Ecs:Refresh:PushEnabled |
refresh.push |
true |
| Long-Poll | Ecs:Refresh:PollIntervalMs |
refresh.pollIntervalMs |
30000 |
| Resiliency | Ecs:Resiliency:Profile |
resiliency |
standard |
Error Model (Surface to Callers)¶
EcsAuthException(401/403)EcsNotFoundException(404 key/path)EcsConcurrencyException(ETag precondition failed)EcsUnavailableException(circuit open / transport down;Innerincludes last cause)EcsTimeoutExceptionEcsValidationException(bad selector/context)
Each includes: traceId, requestId, correlationId, retryAfter (if applicable).
Packaging, Versioning, Governance¶
| Lang | Package | Min Runtime | SemVer |
|---|---|---|---|
| .NET | ConnectSoft.Ecs |
.NET 8 | MAJOR.MINOR.PATCH |
| JS/TS | @connectsoft/ecs |
ES2019+ | Same |
| RN | @connectsoft/ecs-react-native |
RN 0.74+ | Same |
| MAUI | Uses .NET package | .NET 8 MAUI | Same |
- Capability Negotiation at init: server advertises supported transports and features → SDK enables/guards accordingly.
- Deprecations: compile-time
[Obsolete](.NET) / JSDoc tags (JS) + runtime warning with link to migration docs.
Security Considerations¶
- Tokens stored in OS-backed secure stores (Keychain/DPAPI/Keystore).
- PII-safe telemetry; opt-in only for value sampling.
- mTLS support via client cert injection (.NET/MAUI only, enterprise edition).
- Scoped Claims enforced in gateway; SDK only forwards tokens.
Testability & Observability Hooks¶
- Deterministic clocks and pluggable time providers for cache TTL tests.
- Transport shim interface for mocking HTTP/gRPC.
- In-memory store to simulate offline for E2E.
- DiagnosticListener (.NET) /
ecs.on(event, handler)(JS) to observe lifecycle events.
Example Usage Patterns¶
Config with Freshness Guard & Fallback
var config = await ecs.GetConfigAsync<int>("limits/maxSessions",
new GetOptions(MustBeFreshWithin: TimeSpan.FromSeconds(1), PreferCache: true, StaleOk: true), ct);
Feature Flag Toggle in UI
if (ecs.flags.isEnabled("ui:newOnboarding", { userId })) {
renderNewFlow();
} else {
renderClassic();
}
Subscription to Section Changes
using var sub = ecs.Subscribe(
new ConfigSubscription { Prefix = "features/" },
change => logger.LogInformation("Changed: {Key} {ETag}", change.Key, change.NewETag));
Solution Architect Notes¶
- Docs to add: per-language quickstarts, migration guide for transport changes, capability negotiation matrix by edition.
- Optional extension: pluggable policy evaluators in SDK to locally enforce tenant policy hints (read-only apps, canary % caps).
- Performance harness: include standard bench suite for cache hit/miss paths using recorded timelines from staging.
Config Studio UI — IA, Screens, Workflows, Policy Editor, Guardrails, Approvals, Audit Trails¶
Objectives¶
Deliver an admin-first Studio to author, validate, approve and publish tenant configuration safely. The Studio is a single-page app (SPA) backed by the ECS Gateway/YARP (BFF), enforcing tenant/edition/env scopes and segregation of duties with full auditability.
Information Architecture (IA)¶
flowchart LR
root[Studio Home]
Ten[Tenant Switcher]
Env[Environment Switcher]
Dash[Dashboard]
CS[Config Sets]
Items[Items]
Snaps[Snapshots & Tags]
Dep[Deployments]
Pol[Policy Editor]
App[Approvals]
Aud[Audit & Activity]
Ops[Operations: Backups/Imports]
Acc[Access & Roles]
root-->Dash
root-->CS-->Items
CS-->Snaps
Snaps-->Dep
root-->Pol
root-->App
root-->Aud
root-->Ops
root-->Acc
root-->Ten
root-->Env
Global chrome
- Tenant switcher (only tenants user is entitled to).
- Environment switcher (dev/test/staging/prod).
- Edition badge (Free/Pro/Enterprise) with effective policy overlays.
- Context pills:
tenant / application / namespace / env.
Role-Based UX & Guarded Actions¶
| Role | Primary Views | Write Capabilities | Guardrails |
|---|---|---|---|
| Viewer | Dashboard, Config Sets, Snapshots, Audit | None | Cannot reveal secret values; masked view only |
| Developer | + Items, Diff, Local Preview | Draft edits, save, request approval | Schema validation must pass; breaking changes require approval |
| Approver | Approvals, Policy summaries | Approve/Reject, schedule | SoD: cannot approve own changes; change window enforcement |
| Tenant Admin | + Policies, Access, Ops | Tagging, alias pins, rollback, import/export | Two-person rule for prod; high-risk actions gated |
| Platform Admin | Cross-tenant operations | Break-glass overrides | Time-boxed, audited with ticket reference |
Key Workflows¶
Draft → Validate → Approve → Publish → Deploy → Refresh¶
sequenceDiagram
participant Dev as Developer
participant Studio as Studio UI
participant API as ECS API (BFF)
participant Policy as Policy Engine
participant Approver as Approver
participant Registry as Config Registry
participant Orchestrator as Refresh Orchestrator
Dev->>Studio: Edit draft (items)
Studio->>API: PUT /config-sets/{id}/items (draft)
Dev->>Studio: Validate
Studio->>Policy: POST /policies/validate (draft+context)
Policy-->>Studio: Result (violations, riskScore)
Dev->>Studio: Request approval
Studio->>API: POST /approvals (changeSet, riskScore)
Approver->>Studio: Review diff, policy summary
Approver->>API: POST /approvals/{req}:approve (window, notes)
API->>Registry: POST /config-sets/{id}/snapshots (Idempotency-Key)
Registry-->>Studio: Snapshot created (etag, version)
API->>Registry: POST /deployments (env, canary?)
Registry->>Orchestrator: ConfigPublished
Orchestrator-->>Studio: Deployment status & refresh signal
Decision points
- Risk scoring (see Guardrails) drives required approvals count.
- Change windows block immediate publish to prod if outside permitted window.
Screens & Routes¶
| Screen | Purpose | Route | Key Backend Calls | |
|---|---|---|---|---|
| Dashboard | Tenant health, recent changes, pending approvals | /dashboard |
GET /config-sets?limit=5, GET /approvals?status=pending |
|
| Config Sets | List/search sets; create | /config-sets |
GET /config-sets, POST /config-sets |
|
| Items Editor | Edit key/values with schema hints | /config-sets/:id/items |
GET /config-sets/{id}/items, PUT /items/{key}, POST /items:batch |
|
| Diff & Validation | Compare working vs snapshot; policy violations | /config-sets/:id/diff |
POST /diff, POST /policies/validate |
|
| Snapshots & Tags | View versions, tag SemVer, manage aliases | /config-sets/:id/snapshots |
GET /snapshots, POST /snapshots, PUT /tags, PUT /aliases/* |
|
| Deployments | Create/monitor deployments; canary waves | /deployments |
POST /deployments, GET /deployments |
|
| Approvals Inbox | Review requests; approve/reject with notes | /approvals |
GET /approvals, `POST /approvals/{id}:approve |
reject` |
| Policy Editor | Edit rules, schemas, edition overlays | /policies |
GET/PUT /policies, GET/PUT /schemas |
|
| Audit & Activity | Tamper-evident timeline, filters, export | /audit |
GET /audit?filter=..., export |
|
| Operations | Import/export, backups, snapshots | /ops |
POST /snapshots/export, POST /import |
|
| Access & Roles | Assign roles; view effective permissions | /access |
GET/PUT /roles, GET /whoami |
Items Editor — UX Details¶
- Tree + Table hybrid: hierarchical left nav (paths), key table with inline type detection.
- Content types: JSON (editor with schema-aware IntelliSense), YAML, Text.
- Validation overlay: inline markers (red=blocking, amber=warning).
- Secrets: write-only fields using secret references (kvref://…), never show plaintext.
- ETag awareness: editor surfaces server ETag; PATCH requires
If-Match.
Policy Editor¶
Modes
- Visual Builder: conditions, targets, effects (allow/deny/transform).
- Code View: DSL/JSON with schema, auto-format & lint.
- Test Bench: input context (tenant, env, labels) → evaluate → see result and trace.
Artifacts
- JSON Schema per namespace/path; used for editor IntelliSense & validation.
- Edition Overlays: Pro/Enterprise feature gates, default limits, quotas.
- Change Windows: cron-like rules per environment.
Sample Policy (DSL excerpt)
rule: deny-breaking-changes-in-prod
when:
env: "prod"
change.severity: "breaking"
then:
deny: true
message: "Breaking changes require CAB approval"
Guardrails & Risk Scoring¶
| Guardrail | Description | Severity | Enforcement |
|---|---|---|---|
| Schema violations | JSON Schema errors | Blocking | Must fix before approval |
| Breaking change | Remove/rename keys | High | Requires 2 approvals + change window |
| Secrets handling | Plaintext secret detected | High | Block; enforce kvref usage |
| Blast radius | Affects >N services | Medium/High | Extra approver or canary required |
| Change budget | Too many changes per window | Medium | Auto-suggest canary |
| Out-of-window publish | Not in allowed window | Medium | Schedule or request override |
Risk Score = Σ(weight × finding) → Approval policy:
- Score < 20 → 1 approver (owner/lead)
- 20–49 → 2 approvers (different role/team)
- ≥ 50 → CAB (admin + approver; SoD enforced)
Approvals Model¶
- SoD: approver cannot be the author; cannot share same group if risk ≥ 20.
- Delegations: time-boxed with reason & ticket link.
- Scheduling: approval can schedule publish/deploy in a future change window.
- Evidence: approval captures diff, policy results, tests, and risk score snapshot.
Approval View
- Left: Diff summary grouped by severity.
- Right: Policy findings, checks passed/failed, required approvers checklist.
- Footer: Approve/Reject with notes; Require Canary toggle.
Audit Trails¶
Tamper-evident timeline
- Events hash-chained; show chain integrity indicator.
- Filters: actor, resource, action, risk, status, env, time window.
- Drill-down to raw CloudEvent and outbox record.
Exports
- CSV/Parquet; signed export manifest with hash of contents.
- Redaction presets (no tenant PII, no secret refs).
Typical events
DraftEdited,ValidationRun,ApprovalRequested/Granted/Rejected,SnapshotCreated,DeploymentStarted/Completed,RollbackInitiated,PolicyUpdated,AccessChanged.
Accessibility, Usability & Internationalization¶
- WCAG 2.1 AA: keyboard-first navigation, visible focus, ARIA roles.
- i18n: ICU message format; RTL support; date/number locale aware.
- Dark mode with contrast ≥ 7:1 for critical toasts/errors.
UI Architecture & Runtime¶
flowchart LR
SPA[Config Studio SPA]
BFF[YARP / Studio BFF]
API[Registry/Policy APIs]
BUS((Event Bus))
WS[WS Bridge]
SPA--REST/gRPC-->BFF
BFF--mTLS-->API
SPA--WS/SSE-->WS
API--CloudEvents-->BUS
WS--subscribes-->BUS
- State management: RTK Query (web) / Recoil; optimistic updates for drafts.
- Transport: gRPC-web for reads; REST for writes with
Idempotency-Key. - Security: OIDC PKCE; refresh tokens stored in secure storage; CSP locked; SRI for static assets; iframe embedding denied.
Error Handling & Toaster Patterns¶
- Blocking errors bubble to Problem Details panel with
traceId. - Non-blocking validation issues remain inline with links to policy docs.
- Retry affordances for transient (
429/503) with backoff hint.
Telemetry & Diagnostics¶
- User actions →
ecs.studio.action(create-draft, request-approval, approve, publish). - Latency → editor validate, diff compute, publish end-to-end.
- UX health → WS reconnects, long-poll timeouts, save conflicts (412s).
- Session correlation →
x-correlation-idpropagated to API and CloudEvents.
Performance Targets¶
- Diff compute ≤ 300 ms for 5k-key sets (WebAssembly json-patch optional).
- Validation roundtrip ≤ 500 ms p95.
- Page initial load < 2.5 s on 3G fast, TTI < 4 s.
Acceptance Criteria (engineering hand-off)¶
- Routes and components scaffolded with guards per role.
- Items editor with schema-aware linting, secret reference input, ETag display.
- Policy editor (visual + code) + test bench with recorded contexts.
- Approvals workflow with SoD, risk scoring, change windows, and scheduling.
- Audit timeline with hash-chain indicator and export.
- End-to-end tests for: validate→approve→publish→deploy; rollback with approvals; out-of-window scheduling.
Solution Architect Notes¶
- Prefer server-driven UI metadata (schemas, overlays, limits) to keep Studio thin and edition-aware without redeploys.
- Introduce “Safe Preview” mode: materialize resolved config for a sandbox service to validate integration before publish.
- Consider conflict-free replicated drafts (CRDT) if real-time multi-editor collaboration becomes a requirement.
Policy & Governance — schema validation, rules engine (edition/tenant/env), approvals, SoD, change windows¶
Objectives¶
Define the authoritative policy system that governs how configuration is authored, validated, approved, and published across tenants, editions, and environments. Provide implementable contracts for schema validation, rules evaluation, approvals/SOD, and change windows, with deterministic enforcement points and full auditability.
Architecture Overview¶
flowchart LR
Studio[Config Studio UI]
GW[API Gateway / Envoy]
Policy[Policy Engine (PDP)]
Registry[Config Registry]
Approvals[Approvals Service]
Bus[(Event Bus)]
Studio -- validate/diff --> Policy
GW -- ext_authz --> Policy
Registry -- pre-publish validate --> Policy
Policy -- obligations --> GW
Policy -- CloudEvents --> Bus
Studio --> Approvals
Approvals --> Registry
Principles
- Policy-as-code with signed, versioned bundles.
- Deterministic evaluation order and idempotent outcomes.
- Zero-trust: every sensitive operation checked at the edge or server via PDP.
- Observability-first: decisions are explainable (inputs, rules, obligations).
Policy Artifacts¶
| Artifact | Purpose | Format | Storage |
|---|---|---|---|
| JSON Schema | Structural validation of config items/sets | JSON Schema 2020-12 | Registry (per namespace), cached |
| Rule bundles | Declarative allow/deny/obligate decisions | YAML/JSON (ECS DSL) or Rego (OPA mode) | Policy repo → Policy Engine |
| Edition Overlays | Feature gates, limits per plan | YAML/JSON | Registry/Policy |
| Approval Policies | Required approvers, SoD, thresholds | YAML | Policy Engine |
| Change Windows | Allowed publish/deploy windows | cron + TZ (IANA) | Policy Engine |
| Risk Scoring | Weighted findings → required gates | YAML | Policy Engine |
The engine supports native ECS DSL (recommended) and OPA/Rego as a compatibility mode. Only one mode enabled per deployment.
Evaluation Model¶
Inputs (decision context)
tenantId,editionId,environment,appId,namespaceactor { sub, roles[], groups[], mfa, isBreakGlass }operation(e.g.,config.publish,config.modify,policy.update)change(diff summary: breaking/additive/neutral; counts)metadata(labels, tags, size, affected services)time(UTC instant + tenant-local TZ)
Order
- Schema Validation (blocking errors)
- Policy Rules (allow/deny + obligations)
- Edition Overlays (transform/limits)
- Approvals & SoD (may issue obligation:
requires_approvals=n) - Change Windows (may convert allow → scheduled)
- Rate & Quota Checks (edition quotas; ties to Gateway)
Outcomes
ALLOW | DENY | ALLOW_WITH_OBLIGATIONS- Obligations:
{ approvals: {required: n, roles: [...]}, scheduleAfter: <ISO8601>, requireCanary: true, maxBlastRadius: N } - Explanation: list of matched rules & reasons.
Contracts — Policy Engine (PDP)¶
REST (gRPC analogs provided)
POST /v1/validate→{ violations[] }(JSON Schema + custom keywords)POST /v1/decide→{ effect, obligations, explanation[] }POST /v1/risk→{ score, findings[] }GET /v1/bundles/:id→ policy artifact materialization (signed)
Decision request (excerpt)
{
"tenantId": "t-123",
"editionId": "pro",
"environment": "prod",
"operation": "config.publish",
"actor": { "sub": "u-9", "roles": ["developer"], "groups": ["team-billing"], "mfa": true },
"change": { "breaking": 1, "additive": 12, "neutral": 5, "affectedServices": 4 },
"time": "2025-08-25T10:00:00Z",
"tz": "Europe/Berlin",
"metadata": { "labels": ["payments","blue"] }
}
Decision response (excerpt)
{
"effect": "ALLOW_WITH_OBLIGATIONS",
"obligations": {
"approvals": { "required": 2, "roles": ["approver","tenant-admin"], "sod": true },
"requireCanary": true,
"scheduleAfter": null
},
"explanation": [
"rule:deny-breaking-without-2-approvals matched",
"rule:prod-requires-change-window matched (window active)"
]
}
Schema Validation¶
- JSON Schema 2020-12 per namespace/path with
$refcomposition. - Custom keywords:
x-ecs-secretRef: true→ value must bekvref://...x-ecs-maxBlastRadius: <int>→ used in risk scoringx-ecs-deprecated: "message"→ warning surface only
- Where enforced: Studio save, CI import, Registry publish, Adapter sync.
- Performance target: ≤ 200 ms p95 for 5k-key set validation.
Example (snippet)
{
"$id": "https://schemas.connectsoft.io/ecs/db.json",
"type": "object",
"properties": {
"connectionString": { "type": "string", "x-ecs-secretRef": true },
"poolSize": { "type": "integer", "minimum": 1, "maximum": 200 }
},
"required": ["connectionString"]
}
Rules Engine (ECS DSL)¶
Structure
apiVersion: ecs.policy/v1
kind: RuleSet
metadata: { name: tenant-prod-guardrails, tenantId: t-123 }
spec:
rules:
- name: deny-plaintext-secrets
when: op == "config.publish" && env == "prod" && any(change.paths, . endsWith "connectionString" && !isSecretRef(.value))
effect: DENY
message: "Secrets must be kvref:// references"
- name: breaking-needs-2-approvals
when: op == "config.publish" && env == "prod" && change.breaking > 0
effect: ALLOW_WITH_OBLIGATIONS
obligations:
approvals.required: 2
approvals.roles: ["approver","tenant-admin"]
approvals.sod: true
requireCanary: true
- name: pro-plan-limits
when: edition == "pro"
effect: ALLOW_WITH_OBLIGATIONS
obligations:
quotas.maxConfigKeys: 10000
quotas.maxRpsRead: 600
- name: prod-change-window
when: op in ["config.publish","deploy.start"] && env == "prod" && !withinWindow("sat-22:00..sun-04:00", tz)
effect: ALLOW_WITH_OBLIGATIONS
obligations:
scheduleAfter: nextWindow("sat-22:00..sun-04:00", tz)
Operators
- Logical:
&& || ! - Comparators:
== != > >= < <= in - Helpers:
any()/all(),withinWindow(),nextWindow(),isSecretRef() - Context vars:
op, env, edition, tenantId, labels[], change.*
Edition/Tenant/Environment Overlays¶
Precedence
global < edition < tenant < environment < namespace
- Edition gates: feature toggles, quotas (RPS/keys/TTL), security posture (mTLS required).
- Tenant customizations: additional controls (SoD strictness, approver roles).
- Environment overrides: windows, rollout strategies.
- Namespace: schema variants and local guardrails.
Bundles compiled into a single effective ruleset per (tenant, environment) at evaluation time; cached by PDP.
Approvals & Segregation of Duties (SoD)¶
Model
- Approval objects created when obligations require it.
- Approver pools resolved from roles/groups; SoD enforces author ≠ approver; if
riskScore ≥ threshold, approvers must come from distinct groups. - Escalation: time-boxed delegation possible (
delegatedTo,expiresAt,ticket). - Evidence: risk report, diff patch, validation output embedded.
Approval Matrix (default)
| Condition | Required Approvals | SoD | Notes |
|---|---|---|---|
| Non-prod, non-breaking | 0 | — | Immediate publish |
| Prod, additive only | 1 | ✓ | Approver role |
| Prod, breaking | 2 | ✓✓ | Approver + Tenant Admin; canary required |
| High blast radius (> N services) | 2 | ✓✓ | Or schedule inside window |
| Emergency (break-glass) | 1 (Platform Admin) | — | Auto-create post-incident review task |
Change Windows¶
- Defined per tenant/environment using CRON+TZ or range expressions.
- PDP helper
withinWindow(window, tz)respects IANA time zones and DST. - Enforcement: outside window, decisions add
scheduleAfterobligation; Platform Admin can override withisBreakGlass=true(audited).
Examples
windows:
prod:
- name: weekend-window
cron: "0 22 * * SAT" # start
duration: "6h"
tz: "Europe/Berlin"
- name: freeze
range: "2025-12-20T00:00..2026-01-05T23:59"
allow: false
Risk Scoring¶
Findings → score (weights configurable)
- Schema violations: blocking (no score)
- Breaking changes:
+30 - Secrets change count:
+5 each(cap 30) - Blast radius (services affected):
+1 each(cap 20) - Out-of-window:
+10 - Author role (developer vs admin):
+0/−5(experience credit) - Test coverage signal (from CI):
−10if integration tests pass
Thresholds
<20→ no approval20–49→ 1 approval≥50→ 2 approvals + canary
Enforcement Points¶
| Point | What is enforced | How |
|---|---|---|
| Gateway (ext_authz) | AuthZ decision for admin APIs; edition quotas | POST /v1/decide with operation mapping (config.modify, config.delete) |
| Registry (pre-publish) | Schema + rules for publish/rollback | validate + decide with operation=config.publish |
| Studio (UX) | Early validation, approver hints | validate + risk for live feedback |
| Approvals Service | Satisfy obligations | Enforce SoD, windows, delegations |
| Adapter Hub | Provider sync guardrails | decide(operation=adapter.sync) for tenants with extra controls |
Storage & Versioning¶
- PolicyBundle(id, version, hash, signedBy, createdAt, scope={tenant|global})
- EffectivePolicy(tenantId, env, bundleId, compiledHash, etag) — hot cache in PDP
- Audit every decision, including inputs, effect, obligations, explanation, duration (ms)
- Signing with Sigstore or KMS-backed JWS; PDP rejects unsigned or stale bundles.
Observability & Audit¶
- Metrics:
pdp.decisions_total{effect},pdp.duration_ms(hist),pdp.cache_hit_ratio,approvals.pending,windows.scheduled_ops - Logs: structured with
decisionId,tenantId,operation,rulesMatched[] - CloudEvents:
ecs.policy.v1.PolicyUpdatedecs.policy.v1.DecisionRendered(optional sampling)ecs.policy.v1.ApprovalRequired/Granted/Rejectedecs.policy.v1.WindowScheduleCreated
Performance & Availability Targets¶
- PDP decision p95 < 5 ms (cached), p99 < 20 ms (cold compile)
- Validate+decide in publish flow < 150 ms p95 for 5k-key diff
- PDP HA: 3 replicas/region, sticky by
(tenantId, env); warm compiled cache on deploy - Cache TTL: bundle cache 60s; revalidate on
PolicyUpdatedevent
Test Matrix (conformance)¶
| Area | Scenario | Expectation |
|---|---|---|
| Schema | invalid secret ref | validate fails, code SCHEMA.SECRET_REF |
| Rules | breaking change prod | ALLOW_WITH_OBLIGATIONS, approvals=2, canary=true |
| Windows | publish outside window | obligation scheduleAfter != null |
| SoD | author approves own change | rejection SOD.VIOLATION |
| Edition | pro tier quotas exceeded | DENY, code QUOTA.EXCEEDED |
| Risk | high blast radius | score ≥ 50 → 2 approvals |
Acceptance Criteria (engineering hand-off)¶
- PDP service with
/validate,/decide,/riskimplemented (REST + gRPC), OTEL enabled. - DSL parser & evaluator with helpers (
withinWindow,isSecretRef,nextWindow). - Bundle signer/loader + hot-reload on
PolicyUpdated. - Gateway ext_authz integration mapping
route → operation. - Approvals service honoring obligations with SoD and scheduling; Studio UI surfaces requirements.
- End-to-end tests for publish with approvals, out-of-window scheduling, and rollback governance.
Solution Architect Notes¶
- Choose ECS DSL as primary for readability; keep OPA compatibility flag for enterprises with existing Rego policies.
- Define a policy pack per edition and allow tenant overrides only by additive (stricter) rules, never weakening base controls.
- Add simulation mode in Studio: run
decideagainst a proposed change to preview approvals/windows before requesting them.
Security Architecture — data protection, secrets, KMS, key rotation, multi-tenant isolation, threat model¶
Objectives¶
Establish a defense-in-depth security design for ECS aligned to ConnectSoft’s Security-First, Compliance-by-Design approach. This section operationalizes data protection, secrets/KMS, key rotation, multi-tenant isolation, and a threat model that maps to concrete controls, runbooks, and acceptance criteria.
Trust Boundaries & Security Plan¶
flowchart LR
subgraph Internet
User[Studio Users]
SDKs[SDKs/Services]
end
WAF[WAF/DoS Shield]
Envoy[API Gateway (JWT/JWKS, RLS, ext_authz)]
YARP[YARP (BFF/Internal Gateway)]
subgraph Core[ECS Services]
REG[Config Registry]
POL[Policy Engine (PDP)]
ORC[Refresh Orchestrator]
HUB[Provider Adapter Hub]
end
CRDB[(CockroachDB)]
REDIS[(Redis Cluster)]
BUS[(Event Bus)]
KMS[(Cloud KMS/HSM)]
VAULT[(Secret Stores: KeyVault/SecretsManager)]
LOGS[(Audit/Telemetry Store)]
User-->WAF-->Envoy-->YARP-->REG & POL & ORC
SDKs-->WAF-->Envoy
REG--mTLS-->CRDB
REG--mTLS-->REDIS
ORC--mTLS-->BUS
HUB--mTLS-->Providers[(Azure/AWS/Consul/Redis/SQL)]
REG--sign/enc-->LOGS
Services--KEK/DEK ops-->KMS
Services--secret refs-->VAULT
Controls by boundary
- Edge: WAF + IP reputation, geo & residency gate, TLS 1.2+, JWT validation (aud/iss/exp/nbf), rate limits.
- Service mesh: mTLS (SPIFFE/SPIRE or workload identity), least-privilege network policies.
- Data: encryption at rest, envelope encryption for sensitive blobs, append-only audits, tamper-evident chains.
Data Classification & Protection¶
| Class | Examples | Storage | At Rest | In Transit | Additional |
|---|---|---|---|---|---|
| Public | API docs, OpenAPI | Object store/CDN | Provider default | TLS | CSP/SRI |
| Internal | Metrics, non-PII logs | LOGS | Disk enc. + log signing (optional) | TLS | PII redaction |
| Confidential | Config values (non-secret), diffs | CRDB JSONB | TDE/cluster enc | mTLS | ETag/ETag-hash only |
| Restricted | Secret refs, tokens, credentials | Vault/KMS | KMS-backed | TLS/mTLS | No plaintext in DB |
| Audit-critical | Approvals, policy decisions | CRDB (append-only) | Enc + hash-chain | TLS/mTLS | WORM export optional |
Policy: No plaintext secrets are persisted in Snapshots; only kvref URIs (see below).
Secrets Architecture¶
Secret References (kvref://)¶
- Format:
kvref://{provider}/{path}#version?opts- Example:
kvref://vault/billing/db-conn#v3
- Example:
- Resolved at read time by Resolver/SDK with caller’s auth context; never materialized into snapshots.
- Secret providers: Azure Key Vault, AWS Secrets Manager, optional HashiCorp Vault.
Resolution flow
- Registry materializes canonical config with secret refs.
- SDK/Resolver sees
kvref://…keys and fetches from provider using short-lived credentials (workload identity or token exchange). - Values cached in SDK memory only (no persistent secret caching). TTL from provider.
Guardrails
- Schema keyword
x-ecs-secretRef: truerequired for fields likeconnectionString. - Studio blocks plaintext entry; provides secret picker UI.
Key Management (KMS/HSM) & Crypto Posture¶
Key Hierarchy (Envelope Encryption)¶
graph TD
CMK[Root Customer Managed Key (per region)] -->|wrap/unwrap| TMK[Tenant Master Key (per tenant)]
TMK -->|wrap| DEK[Data Encryption Keys (resource-scoped)]
DEK -->|encrypt| SENS[Sensitive blobs (e.g., idempotency records with PII hints, optional)]
- CMK: Regional KMS/HSM (e.g., Azure Key Vault Managed HSM / AWS KMS CMK).
- TMK: Derived per tenant; rotatable without re-encrypting data (envelope rewrap).
- DEK: Per table/feature or per artifact class; rotated automatically on cadence.
Cryptographic Controls¶
- TLS 1.2/1.3, strong ciphers; OCSP stapling at edge.
- Hashing: SHA-256 for ETags and lineage hashes; HMAC-SHA256 for value integrity in Redis.
- Optional: JWS signing of Snapshots/Audit exports (future toggle).
Rotation & Credential Hygiene¶
| Asset | Rotation | Method | Blast Radius Control |
|---|---|---|---|
| JWKS (IdP keys) | 90 days | Add new kid, overlap 24h, retire old |
Edge monitors jwt_authn_failed |
| API Keys (webhooks) | 90 days | Dual-key period; HMAC header sha256= |
Per-tenant scope |
| KMS CMK | Yearly | Automatic rotation | Rewrap TMKs opportunistically |
| TMK (per tenant) | 180–365 days | Rewrap DEKs in background | Tenant-scoped jobs |
| Service creds (bus/redis/sql) | 90 days | Workload identity preferred; otherwise secret rotation + pod bounce | Per namespace |
| Redis ACLs | 90 days | Dual ACL users; rotate, then drop | Per region |
| Adapter bindings | 90 days | Lease-based CredentialEnvelope.not_after |
Auto-rebind |
Runbook excerpts
- Key rollover: publish new JWKS; canary Envoy; monitor; retire old.
- TMK rewrap: schedule low-traffic window; checkpoint progress; pause tenant writes if needed (rare).
Multi-Tenant Isolation¶
Data Plane¶
- CRDB:
REGIONAL BY ROWwithtenant_idin PK; repository guards enforce tenant scoping. - Redis: key prefixes
ecs:{tenant}:{env}:…; ACLs restrict prefixes per tenant for dedicated caches. - Events: CloudEvents carry
tenantidextension; per-tenant topics/partitions to avoid cross-talk.
Control Plane¶
- JWT claims:
tenant_id,scopes; ext_authz checks at Envoy; PDP obligations enforce edition limits. - Studio: SoD & role checks; no cross-tenant UI elements.
Compute & Network¶
- K8s namespaces per environment; NetworkPolicies deny east-west by default; adapters run out-of-process with mTLS.
- SPIFFE identities:
spiffe://ecs/{service}/{env}; RBAC by SPIFFE ID for inter-service calls.
Isolation Tests (every release)¶
- Negative tests: attempt cross-tenant read/write via forged headers → expect 403.
- Timing analysis: ensure response timing doesn’t leak tenant existence.
- Event leakage: subscribe to foreign tenant topic → no messages.
Threat Model (STRIDE→Controls)¶
| Threat | Example | Control (Design) | Detective |
|---|---|---|---|
| **S**poofing | Forged JWT | Envoy JWT verify (aud/iss/kid); mTLS internal; token exchange with act chain |
AuthN failure metrics, anomaly IP alerts |
| **T**ampering | Modify snapshot content | Append-only tables; ETag content hash; optional JWS signatures; change requires new version | Hash verification on read; audit hash-chain check |
| **R**epudiation | “I didn’t approve” | Signed decisions; SoD; immutable audit with actor sub@iss; IP/device fingerprint |
Audit dashboards, integrity checks |
| **I**nformation disclosure | Secret leakage in logs | Secret refs only; structured logging with redaction; no secrets persisted | PII/secret scanners in CI and runtime |
| **D**oS | Resolve flood or publish storm | RLS per tenant; SDK backoff; long-poll waiter caps; cache SWR; circuit breakers | 429/latency SLO alerts, waiter pool saturation alarms |
| **E**levation of privilege | Dev acts as admin | Role->scope mapping; PDP policies; break-glass with time-box + extra audit | Privileged action audit, approval anomalies |
| Supply chain | Image tamper | SBOM + cosign signing; admission policy; CVE scanning; base images pinned | Image signature verification, scanner alerts |
| SSRF via Adapters | Malicious provider URL | Allowlist provider endpoints; egress policies; input validation & timeouts | Egress deny logs |
| Replay | Replayed webhook/API calls | HMAC with timestamp; Idempotency-Key for POST; short clock skew | Replay detector metrics |
| Data residency | EU data in US | Regional routing; crdb_region pin; residency policy at PDP |
Residency audit reports |
Application-Level Security Controls¶
- Input validation: JSON Schema enforcement server-side; size caps; content-type allowlist.
- Output encoding/CORS: locked origins for Studio; headers
X-Content-Type-Options,X-Frame-Options,Referrer-Policy. - Least-privilege IAM: service principals scoped to exact resources; adapter credentials per-binding.
- Secure defaults: TLS required; HTTP downgraded traffic rejected; same-site cookies (
LaxorStrict) for Studio.
Observability & Evidence¶
- OTEL everywhere; security spans tag:
tenantId,actor,decisionId,effect. - Security metrics:
authz_denied_total,jwt_authn_failed_total,rate_limited_total,policy_decision_ms,secret_ref_resolution_ms. - Audit exports: signed/hashed; scheduled to WORM storage if enabled.
Incident Response (IR) & Playbooks¶
Key compromise (tenant TMK)
- Quarantine tenant (PDP deny writes; read-only).
- Rotate TMK; rewrap DEKs; re-issue creds.
- Force SDK token renewal; invalidate Redis keys; emit tenant-wide RefreshRequested.
- Forensics: export audit; notify tenant.
Web token abuse
- Revoke client app; rotate JWKS if issuer compromised.
- Invalidate refresh tokens; raise auth challenge requirements (MFA).
- Search logs for
subanomalies; notify tenant.
Data exfil suspicion
- Activate access freeze policy; export and verify audit hash chain.
- Run residency check; compare CRDB region distribution.
Compliance Hooks¶
- Access reviews: quarterly role attestations; report who has
tenant.admin/platform-admin. - Data subject requests (DSAR): audit index supports by-tenant export.
- Pen-testing cadence: semi-annual + post-major release; findings tracked in risk register.
- Third-party adapters: vendor security questionnaire + sandbox tenancy.
Acceptance Criteria (engineering hand-off)¶
- mTLS enabled service-to-service (SPIFFE) with policy-enforced identities.
- Secret refs (
kvref://) enforced by schema; resolver libraries for .NET/JS/Mobile implemented. - KMS envelope encryption library with CMK/TMK/DEK operations + rewrap CLI.
- Rotation jobs & runbooks codified (JWKS, TMK, service creds); canary + rollback procedures.
- Tenant isolation tests automated in CI; event leakage tests in staging.
- WAF, RLS, ext_authz configured; security dashboards and alerts live.
- SBOM generation and image signing required in CI; admission policy verifies signatures.
Solution Architect Notes¶
- Evaluate confidential computing (TEE) for server-side resolution of secrets for high-assurance tenants.
- Consider per-tenant Redis in Enterprise for stronger isolation at cost increase.
- Add JWS snapshot signing behind a feature flag to strengthen non-repudiation.
- Plan red team exercise focused on adapter SSRF and multi-tenant boundary bypass.
Observability — OTEL spans/metrics/logs, trace taxonomy, dashboards, SLOs/SLIs/alerts¶
Objectives¶
Deliver an observability-first design that provides end-to-end traces, actionable metrics, and structured logs across ECS services (Gateway, YARP, Registry, Policy Engine, Refresh Orchestrator, Adapter Hub, SDKs). Define trace taxonomy, metrics catalogs, dashboards, and SLOs/SLIs/alerts with multi-tenant visibility and low-cardinality guardrails.
Architecture & Dataflow¶
flowchart LR
subgraph Client
SDKs[SDKs / Studio]
end
WAF[WAF/DoS]
Envoy[Envoy (Edge)]
YARP[YARP (BFF)]
REG[Config Registry]
PDP[Policy Engine (PDP)]
ORC[Refresh Orchestrator]
HUB[Provider Adapter Hub]
CRDB[(CockroachDB)]
REDIS[(Redis)]
BUS[(Event Bus)]
COL[OTEL Collector]
TS[(Time-series DB)]
LS[(Log Store/Index)]
TMS[(Tracing DB)]
APM[Dashboards/Alerting]
SDKs-->WAF-->Envoy-->YARP-->REG
REG-->PDP
REG-->REDIS
REG-->CRDB
REG-->BUS
ORC-->BUS
HUB-->Providers[(External Providers)]
HUB-->BUS
Envoy--OTLP-->COL
YARP--OTLP-->COL
REG--OTLP-->COL
PDP--OTLP-->COL
ORC--OTLP-->COL
HUB--OTLP-->COL
COL-->TS
COL-->LS
COL-->TMS
APM---TS
APM---TMS
APM---LS
Collector topology
- DaemonSet + Gateway collectors; tail-based sampling at gateway.
- Pipelines:
traces(OTLP → sampler → TMS),metrics(OTLP → TS),logs(OTLP/OTLP-logs → LS). - Exemplars: latency histograms carry trace IDs to deep-link from metrics to traces.
Trace Taxonomy & Propagation¶
Naming (verb-noun, dot-scoped)¶
| Layer | Span name | Kind |
|---|---|---|
| Edge | ecs.http.server (Envoy) |
SERVER |
| BFF | ecs.http.proxy (YARP routeId) |
SERVER/CLIENT |
| Registry | ecs.config.resolve, ecs.config.publish, ecs.config.diff |
SERVER |
| PDP | ecs.policy.decide, ecs.policy.validate |
SERVER |
| Orchestrator | ecs.refresh.fanout, ecs.refresh.invalidate |
INTERNAL |
| Adapter Hub | ecs.adapter.put, ecs.adapter.watch |
CLIENT/SERVER |
| DB | db.sql.query (CRDB) |
CLIENT |
| Redis | cache.get, cache.set, cache.lock |
CLIENT |
| Bus | messaging.publish, messaging.process |
PRODUCER/CONSUMER |
| SDK | ecs.sdk.resolve, ecs.sdk.subscribe |
CLIENT |
Required attributes (consistent keys)¶
tenant.id,env.name,app.id,edition.id,plan.tierroute,http.method,http.status_code,grpc.codeuser.sub(hashed),actor.type(user|machine)config.pathorconfig.set_id,etag,versionpolicy.effect(allow|deny|obligate),policy.rules_matchedcache.hit(true|false|swr),cache.layer(memory|redis)db.statement_sanitized(true)rls.result(ok|throttled)error.type,error.code
Context propagation¶
- W3C Trace Context:
traceparent+tracestate - Baggage (limited):
tenant.id,env.name,app.id(≤3 keys) - CloudEvents: include
traceparentextension;correlationidmirrors trace ID for non-OTLP consumers. - Headers surfaced to services:
x-correlation-id(stable),traceparent(for logs).
Sampling¶
- Head sampling default 10% for healthy traffic (dynamic).
- Tail-based rules (max 100%):
http.status >= 500orgrpc.code != OK- Latency above p95 per route
policy.effect != allowroute in [/deployments, /snapshots, /refresh]- Tenant allowlist (troubleshooting sessions)
Metrics Catalog (SLI-ready, low-cardinality)¶
Dimensions used widely:
tenant.id(hashed),env.name,region,route|rpc,result, avoidkey/pathlabels.
Edge & Gateway¶
http_requests_total{route,code,tenant}(counter)http_request_duration_ms{route}(histogram with exemplars)ratelimit_decisions_total{tenant,result}(counter)
Registry / Resolve path¶
resolve_requests_total{result}(counter: hit|miss|304|200)resolve_duration_ms{result}(histogram)cache_hits_total{layer}(counter)waiter_pool_size(gauge),coalesce_ratio(gauge)etag_mismatch_total(counter)
Publish / Versioning¶
publish_total{result}(counter)publish_duration_ms(histogram)diff_duration_ms(histogram),snapshot_size_bytes(histogram)policy_block_total{reason}(counter)
PDP¶
pdp_decisions_total{effect}(counter)pdp_duration_ms(histogram)pdp_cache_hit_ratio(gauge)
Refresh Orchestrator & Eventing¶
refresh_events_total{type}(counter)propagation_lag_ms(histogram: publish→first 200 resolve)ws_connections{tenant}(gauge),ws_reconnects_total(counter)dlq_messages{queue}(gauge),replay_total{result}(counter)
Adapters¶
adapter_ops_total{provider,op,result}(counter)adapter_op_duration_ms{provider,op}(histogram)adapter_throttle_total{provider}(counter)adapter_watch_gaps_total{provider}(counter)
Data stores¶
- CRDB:
sql_qps,txn_restarts_total,replica_leaseholders_ratio,kv_raft_commit_latency_ms - Redis:
hits,misses,latency_ms,evictions,blocked_clients
Logs (structured, privacy-safe)¶
Envelope (JSON)¶
{
"ts":"2025-08-25T10:21:33.410Z",
"sev":"INFO",
"svc":"registry",
"msg":"Resolve completed",
"trace_id":"d1f0...",
"span_id":"a21e...",
"corr_id":"c-77f2...",
"tenant.id":"t-***5e",
"env.name":"prod",
"route":"/api/v1/resolve",
"duration_ms":42,
"cache.hit":true,
"etag":"9oM1hQ...",
"user.sub_hash":"u-***9a",
"code":"OK",
"extra":{"coalesced":3}
}
Redaction & hygiene
- Never log config values or secrets; keys hashed when necessary.
- PII hashed/salted; toggle dev sampling only in non-prod.
- Enforce log schema via collector processors; drop fields not in schema.
Dashboards (role-focused)¶
1) SRE — “ECS Golden Signals”¶
- Edge: request rate, error %, p95 latency by route, 429 rate.
- Resolve: hit ratio (memory/redis), resolve p95, waiter size/coalesce ratio.
- Propagation: publish→first resolve lag p95/p99, WS connections, DLQ size.
- DB/Cache: CRDB restarts & raft latency, Redis latency & evictions.
- Burn-rate tiles for SLOs (see below) with auto-links to exemplar traces.
2) Product/Platform — “Tenant Health”¶
- Per-tenant overview: usage vs quota, top endpoints, rate-limit hits, policy denials.
- Drill-down: deployment success rate, change lead time (draft→publish).
3) Security — “Policy & Access”¶
- PDP decision mix, denied reasons, break-glass actions, secret-ref validation failures.
- Geo & residency conformance (data access by region).
4) Adapters & Eventing¶
- Provider op success/latency by adapter, throttle events, watch gaps.
- Event bus throughput, consumer lag, DLQ trends, replay outcomes.
SLOs, SLIs & Alerts¶
Service SLOs (initial targets)¶
| Service/Path | SLI | SLO (rolling 30d) | Error budget |
|---|---|---|---|
| Resolve (SDK) | Availability = 1−(5xx+network errors)/all | 99.95% | 21m/mo |
| Latency p95 (in-region) | ≤ 250 ms | N/A | |
| Publish (Studio/API) | Success rate for publish operations | 99.5% | 3.6h/mo |
| Publish→first client refresh p95 | ≤ 5 s (Pro/Ent), ≤ 15 s (Starter) | N/A | |
| PDP | Decision p95 | ≤ 5 ms | N/A |
| WS Bridge | Connection stability (drop rate per hour) | ≤ 1% | N/A |
Definitions
- Propagation lag = time from
ConfigPublishedaccepted to first successful clientResolve(200/304)with new ETag per tenant/env.
Multi-window burn-rate alerts (error-budget based)¶
- Page: BR ≥ 14 over 5m and ≥ 7 over 1h
- Warn: BR ≥ 2 over 6h and ≥ 1 over 24h
Symptom alerts (key signals)
- Resolve p95 > 300 ms (10m)
- Rate-limit 429% > 2% (5m) for any tenant
- Propagation lag p95 > 8 s (15m) or DLQ size > 100 for 10m
- PDP decision p99 > 20 ms (10m)
- Redis evictions > 0 (5m) sustained
All alerts annotate current incidents with links to exemplar traces and last deployments.
Collector & Policy (reference snippets)¶
Tail-based sampling (OTEL Collector)¶
receivers:
otlp: { protocols: { http: {}, grpc: {} } }
processors:
attributes:
actions:
- key: tenant.id
action: hash
tail_sampling:
decision_wait: 5s
policies:
- name: errors
type: status_code
status_codes: {status_codes: [ERROR]}
- name: long-latency
type: latency
latency: {threshold_ms: 250}
- name: important-routes
type: string_attribute
string_attribute: {key: http.target, values: ["/deployments", "/snapshots", "/refresh"]}
exporters:
otlphttp/traces: { endpoint: "https://trace.example/otlp" }
prometheus: { endpoint: "0.0.0.0:9464" }
loki: { endpoint: "https://logs.example/loki/api/v1/push" }
service:
pipelines:
traces: { receivers: [otlp], processors: [attributes, tail_sampling], exporters: [otlphttp/traces] }
metrics: { receivers: [otlp], processors: [], exporters: [prometheus] }
logs: { receivers: [otlp], processors: [attributes], exporters: [loki] }
Correlation & Troubleshooting Playbooks¶
- Start from SLO tiles → click exemplar to open failing trace.
- From trace: check policy.decide span (effect & rules), cache layer tags, db/sql timings.
- If 429s spike: inspect ratelimit_decisions_total by tenant; apply temporary override or identify hot route.
- Propagation issues: check DLQ and refresh.fanout spans; verify Redis latency; perform targeted replay.
- Regional anomalies: compare leaseholder distribution and raft commit latency.
Cardinality & Cost Guardrails¶
- Disallow high-cardinality labels: raw key/path, user IDs, stack traces in metrics.
- Hash tenant IDs and user subjects; limit unique operation names.
- Log retention: 14d (non-audit) default; 30–365d for audit in dedicated store.
- Metrics retention: 13 months downsampled; traces 7–14d with error traces kept longer.
Acceptance Criteria (engineering hand-off)¶
- OTEL SDKs enabled in all services and client SDKs (.NET/JS/Mobile) with required attributes.
- Collector pipelines deployed (daemonset + gateway) with tail-based sampling and attribute hashing.
- Metrics emitted per catalog; histograms with exemplars enabled.
- Structured JSON log schema enforced; PII/secret redaction verified.
- Dashboards delivered for SRE, Tenant Health, Security, Adapters/Eventing.
- SLOs encoded in monitoring system with burn-rate alert policies and on-call rotations.
- Runbooks published for 429 spikes, propagation lag, DLQ growth, PDP degradation.
Solution Architect Notes¶
- Consider exemplars+RUM in Studio to tie user actions to backend traces.
- Add feature-flagged debug sampling per tenant (temporary, auto-expires) to diagnose issues without global sampling changes.
- Revisit metrics cardinality quarterly; adopt RED/USE dashboards for each microservice as standard templates.
Performance & Capacity — read-heavy scaling plan, Redis sizing, p99 targets, load profiles, KEDA/HPA¶
Objectives¶
Design a read-heavy, latency-bounded capacity plan that meets ECS SLOs under bursty, multi-tenant loads. This section specifies p99 targets, traffic models, Redis sizing, and autoscaling policies (KEDA/HPA) for Registry/Resolve APIs, Refresh Orchestrator, WS/Long-Poll bridges, and Adapter Hub.
Targets & Guardrails¶
| Path | Region | p95 | p99 | Notes |
|---|---|---|---|---|
| Resolve (cache hit) | in-region | ≤ 50 ms | ≤ 150 ms | CPU-bound; Redis round-trip avoided |
| Resolve (cache miss) | in-region | ≤ 200 ms | ≤ 400 ms | Redis hit + CRDB read-through |
| Long-Poll wake→200/304 | in-region | ≤ 1.5 s | ≤ 3 s | From publish to client response |
| WS push→client revalidate | in-region | ≤ 2.5 s | ≤ 5 s | Aligns with propagation SLOs |
| Publish (snapshot→accepted) | in-region | ≤ 1.0 s | ≤ 2.0 s | Excludes canaries/approvals |
Error budgets inherit from Observability cycle: Resolve availability ≥ 99.95%.
Traffic Modeling (read-heavy)¶
Let:
N_t= tenants,A= apps/tenant,E= envs/app,S= sets/env,K= keys/setH= cache hit ratio at SDK (L1) → target ≥ 0.85R_sdk= average client read rate per instance (req/s) when polling/long-poll refreshesC= active client instances
Resolve QPS to Gateway/Registry¶
- Periodic pull:
QPS = C × R_sdk × (1 − H) - Long-poll (recommended): steady-state
QPS ≈ C × (timeout⁻¹) × (1 − H), with held connections≈ C. Example:C=20k,timeout=30 s,H=0.9→QPS ≈ 20k × (1/30) × 0.1 ≈ 66.7 RPS. - WS push + conditional fetch:
QPS ≈ events × fanout_selectivity × (1 − H_etag); typical<<long-poll.
Memory for Long-Poll Waiters¶
- Per waiter ~ 1–3 KB (request ctx + selector).
With
C=20kwaiters → 20–60 MB of heap headroom per region (well within a small pool of pods).
Redis Sizing & Topology¶
Key formulas¶
- Hot value size
B(compressed on wire): target 8–32 KB (avg take 16 KB). - Cache entry ≈
B × overhead_factorwhere overhead (key + object + allocator) ≈ 1.3–1.5. Use 1.4. - Working set per region
WS ≈ active_keys × B × 1.4.
Shard plan¶
- Aim per shard ≤ 25 GB resident (for 64 GB nodes → memory headroom & AOF/RDB).
- Shards per region
= ceil(WS / 25 GB), then ×2 for replicas.
Example (Pro/Ent region)¶
- Tenants actively reading: 1,000
- Active keys/tenant hot: 800
active_keys = 1,000 × 800 = 800,000WS ≈ 800,000 × 16 KB × 1.4 ≈ 17.9 GB- Shards needed: 1 (primary) + 1 (replica) → deploy 3 shards to allow headroom and growth, then rebalance.
Throughput
- Single shard (modern VM, network-optimized) sustains 80–120 k ops/s p99 < 2 ms for GET/SET under pipeline.
- With hit ratio ≥ 0.9, Registry avoids CRDB on ≥ 90% of reads.
Policies
- Eviction: volatile-lru (Enterprise) / allkeys-lru (Starter/Pro).
- SWR ≤ 2 s to bound staleness and prevent stampedes.
- Hash-tagging keys by
{tenant}to isolate hotspots and ease resharding.
Pod Right-Sizing & Concurrency¶
| Component | CPU req/lim | Mem req/lim | Concurrency (target) | Notes |
|---|---|---|---|---|
| Registry/Resolve (.NET) | 250m / 1.5c | 512 Mi / 1.5 Gi | 400–600 in-flight | Kestrel + async IO, thread-pool min tuned |
| WS Bridge | 200m / 1c | 256 Mi / 1 Gi | 5k conns/pod | uWSGI/Kestrel; heartbeat @ 30 s |
| Refresh Orchestrator | 200m / 1c | 256 Mi / 1 Gi | 200 msgs/s | CPU-light, IO-heavy |
| Adapter Hub | 300m / 1.5c | 512 Mi / 1.5 Gi | 1k ops/s | Batch writes; backpressure aware |
Autoscaling (HPA v2 + KEDA)¶
Registry/Resolve — HPA with custom metrics¶
- Primary: in-flight request gauge
http_server_active_requests(target ≤ 400/pod) - Secondary: CPU target 60%
- Tertiary: p95 latency guardrail (scale-out if > 200 ms for 5 m)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: ecs-registry }
spec:
minReplicas: 4
maxReplicas: 48
metrics:
- type: Pods
pods:
metric:
name: http_server_active_requests
target:
type: AverageValue
averageValue: "400"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent; value: 100; periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent; value: 25; periodSeconds: 60
WS Bridge — HPA on connection count¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: ecs-ws-bridge }
spec:
minReplicas: 2
maxReplicas: 50
metrics:
- type: Pods
pods:
metric: { name: ws_active_connections }
target: { type: AverageValue, averageValue: "4000" } # keep <4k/pod
Refresh Orchestrator — KEDA on bus depth/throttle¶
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: ecs-refresh-orchestrator }
spec:
scaleTargetRef: { name: ecs-refresh-orchestrator }
pollingInterval: 5
cooldownPeriod: 60
minReplicaCount: 2
maxReplicaCount: 40
triggers:
- type: azure-servicebus
metadata:
queueName: ecs.refresh.events
namespace: sb-ecs
messageCount: "500" # 1 replica per 500 pending
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: propagation_lag_ms
threshold: "3000" # scale if lag > 3s
Adapter Hub — KEDA on ASB/Rabbit and provider throttles¶
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: ecs-adapter-hub }
spec:
scaleTargetRef: { name: ecs-adapter-hub }
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: adapter_throttle_total
threshold: "0" # scale up if throttling detected
- type: prometheus
metadata:
metricName: adapter_ops_backlog
query: sum(rate(adapter_ops_queued_total[1m]))
threshold: "200"
Load Profiles & Playbooks¶
LP-1 Baseline Steady State¶
- Long-poll connections equal to active clients; low QPS due to conditional 304s.
- SLO focus: latency p99 and WS stability.
- Action: monitor waiter pool size, Redis hit ratio ≥ 0.9.
LP-2 Publish Wave (canary→regional)¶
- Burst of RefreshEvents, clients revalidate.
- Action: Orchestrator coalesces events; Registry scales on in-flight requests; Redis lock singleflight enabled.
LP-3 Tenant Hotspot¶
- One tenant updates many keys; selectivity high.
- Action: Per-tenant rate limit clamps; Redis hash-tag isolates shard; temporary override if approved.
LP-4 Cold Start / Region Failover¶
- Rapid connection re-establish; cache cold.
- Action: Pre-warm Redis with last heads; disable aggressive scale-down; surge HPA caps increased for 30 min.
LP-5 Adapter Backpressure¶
- Provider throttles (Azure/AWS).
- Action: KEDA scales Hub, adjusts batch sizes; DLQ monitored; replay after throttle clears.
Capacity Examples (per region)¶
Scenario A — 20k clients, long-poll 30 s, H=0.9¶
- Waiters: 20k; Gateway pods with 4k conns/pod → 5 pods minimum.
- Resolve QPS: ~67 RPS; 4 Registry pods (400 in-flight each) → ample headroom.
- Redis: WS ≈ 10–20 GB → 3-shard cluster (1 primary, 1 replica each) for headroom.
Scenario B — 100k clients, mixed WS/poll¶
- 70% WS, 30% long-poll; H=0.85.
- Long-poll QPS ≈
30k × (1/30) × 0.15 ≈ 150 RPS. - WS Bridge: 100k conns total → 25 pods (4k/pod).
- Registry: 8–12 pods depending on p95.
- Redis WS ~ 40–60 GB → 3–4 primaries + replicas.
Scale numbers must be validated in perf tests; start with 2× headroom over 30-day peak.
CRDB Considerations (read-through misses)¶
- Keep leaseholders local to region (REGIONAL BY ROW).
- Target miss rate ≤ 10%; CRDB nodes sized for 2–5k QPS aggregated reads with p99 < 20 ms KV.
- Monitor
txn_restarts_total; if > 2% elevateLIMITSand slow publishers.
Performance Testing Plan¶
| Tool | Purpose | Profile |
|---|---|---|
| k6 / Locust | Resolve & long-poll | VUs ramp to 100k conns; p95/p99 latency |
| Vegeta | Publish bursts | 10–50 RPS publish; measure propagation lag |
| Custom WS harness | Push & reconnect storms | 100k connections, 1% churn/min |
| Redis benchmark | Cache ops | pipelined GET/SET @ 128B–64KB values |
Exit criteria
- Sustained p99 Resolve within targets at 1.5× projected peak.
- No DLQ growth during publish wave; propagation p95 within ≤ 5 s.
- HPA/KEDA converges within < 90 s of surge onset.
Cost & Efficiency Levers¶
- Prefer long-poll default; enable WS for high-value tenants only.
- SWR and TTL jitter reduce cache churn by 20–40% under waves.
- Coalescing ratio target ≥ 3× (one backend fetch serves ≥ 3 waiters).
- Scale-to-zero for low-traffic Adapters via KEDA cooldown.
Runbooks (snippets)¶
Latency p99 regression
- Check coalesce_ratio and waiter_pool_size; if low, enable singleflight tuning.
- Inspect Redis
latency_msandevictions; add shard if > 3 ms p95 or any evictions. - Verify HPA not pinned; temporarily raise
maxReplicas.
Propagation lag > 8 s
- Validate bus consumer lag; KEDA scale orchestrator.
- Check WS connections; if drops > 1%/h, investigate bridge GC/ephemeral port exhaustion.
- Redis lock contention > 10 ms → increase lock TTL and reduce batch size.
Acceptance Criteria (engineering hand-off)¶
- HPA manifests for Registry and WS Bridge; KEDA ScaledObjects for Orchestrator and Adapter Hub committed.
- Redis cluster charts with shard math and hash-tag strategy documented; alerts for
evictions>0. - Perf harnesses & CI jobs executing weekly; publish p95/p99 & coalescing ratio to dashboards.
- Capacity workbook with what-if calculators (tenants, clients, TTL, H) shared with SRE.
- Runbooks for failover, surge, hot tenant, adapter throttle scenarios.
Solution Architect Notes¶
- Maintain two autoscaling lanes: fast (in-flight, latency) and slow (CPU). Tune stabilization to avoid oscillation.
- Revisit connection per pod limits after GC tuning; consider SO_REUSEPORT listeners for WS scale.
- For very large tenants, consider enterprise per-tenant Redis or keyspace quotas to cap blast radius.
- Evaluate async Resolve (server push of materialized blobs to edge caches) if hit ratio < 0.8 at scale.
Resiliency & Chaos — timeouts/retries/backoff, bulkheads, circuit breakers, chaos experiments & runbooks¶
Objectives¶
Engineer ECS to degrade gracefully under dependency faults, traffic spikes, and regional impairments. This section codifies timeouts, retries, and backoff, bulkheads, circuit breakers, and a chaos program with experiments and runbooks. Defaults are opinionated yet overridable via per-service policy.
Resiliency Profiles (standardized)¶
# Resiliency profiles are shipped as config + code defaults; services load on boot.
profiles:
standard:
timeouts:
http_connect_ms: 2000
http_overall_ms: 4000
grpc_unary_deadline_ms: 250 # Resolve path (in-region)
grpc_batch_deadline_ms: 2000
grpc_stream_idle_min: 60
redis_ms: 200
sql_read_ms: 500
sql_write_ms: 1500
bus_send_ms: 2000
retries:
strategy: "decorrelated_jitter" # FullJitter/EqualJitter allowed
max_attempts: 3
base_delay_ms: 100
max_delay_ms: 800
retry_on: ["UNAVAILABLE","DEADLINE_EXCEEDED","RESOURCE_EXHAUSTED","5xx"]
idempotency_required: true # enforced by client for POST/PUT/DELETE
circuit_breaker:
sliding_window: "rolling"
window_size_sec: 30
failure_rate_threshold_pct: 50
min_throughput: 20
consecutive_failures_to_open: 5
open_state_duration_sec: 30
half_open_max_concurrent: 5
bulkheads:
max_inflight_per_pod: 600 # Registry
per_tenant_max_inflight: 80
redis_pool_size: 256
sql_pool_size: 64
http_pool_per_host: 128
fallback:
serve_stale_while_revalidate_ms: 2000
long_poll_timeout_s: 30
downgrade_to_poll_on_ws_fail: true
conservative: # for cross-region or under incident
timeouts:
grpc_unary_deadline_ms: 400
redis_ms: 300
retries:
max_attempts: 2
base_delay_ms: 200
max_delay_ms: 1200
circuit_breaker:
open_state_duration_sec: 60
Decorrelated jitter backoff (pseudo):
Timeouts, Retries & Backoff — per dependency¶
| Caller → Callee | Timeout | Retries (strategy) | Preconditions |
|---|---|---|---|
| SDK → Registry (Resolve unary) | 250–400 ms deadline | 2–3 (jitter) | Always send If-None-Match ETag; idempotent |
| SDK ↔ WS Bridge (push) | Heartbeat 15–30 s, close after 3 misses | Reconnect with backoff 0.5–5 s | Resume token supported |
| Registry → Redis | 200 ms | 2 (jitter 50–250 ms) | Use singleflight lock on fill |
| Registry → CRDB (read) | 500 ms | 1 (if RETRY_SERIALIZABLE) |
Read-only; bounded scans |
| Registry → CRDB (write) | 1.5 s | 2 (txn restart only) | Idempotency key on publish |
| Orchestrator → Bus | 2 s | 5 (exp jitter up to 60 s) | Outbox pattern ensures atomicity |
| Hub → Adapters | 1–3 s per op | 3 (provider-aware backoff) | Idempotency key mandatory |
| Adapter → Provider (watch) | Stream idle 60 min | Auto-resume with bookmark | Backpressure aware |
| Gateway → PDP | 5–20 ms | 0 (timeout only) | Fallback deny on timeout (safe default) |
Rules
- Retries only on transient errors. Never retry on
4xx(except429withRetry-After). - Mutations require Idempotency-Key; otherwise retries are disabled.
- Budget: total attempt duration ≤ caller timeout; avoid retry storms.
Bulkheads & Isolation¶
Concurrency bulkheads
- Per-pod: cap in-flight requests (Registry
≤ 600), queue excess briefly (≤ 50 ms) then 429. - Per-tenant: token bucket at Gateway; defaults by edition (Starter 50 RPS, Pro 200, Ent 1000; burst ×2).
- Per-dependency pools: separated HttpClient pools per host; Redis/SQL dedicated pools with backpressure.
Resource bulkheads
- Thread pools: pre-warm worker count; limit sync over async.
- Queue isolation: separate DLQ/parking lots per consumer to avoid global blockage.
- Waiter map (long-poll): bounded dictionary; spill to 304 on exhaustion.
Blast-radius controls
- Hash-tag keys
{tenant}in Redis; circuit break per tenant first, then globally only if necessary. - Rate-limit publish waves (canary → region) via Orchestrator.
Circuit Breakers¶
State machine (per dependency, per tenant when applicable)
stateDiagram-v2
[*] --> Closed
Closed --> Open: failure rate > threshold OR 5 consecutive failures
Open --> HalfOpen: after open_duration
HalfOpen --> Closed: probe successes (>=3/5)
HalfOpen --> Open: any failure
Telemetry & policy
- Emit
ecs.circuit.stategauge{service, dependency, tenant?, state}. - Closed: normal.
- Open: SDKs and services fail fast with fallback (serve stale, downgrade to poll, deny writes).
- Half-open: limit concurrent probes to ≤5.
Degrade Modes & Fallbacks¶
| Fault | Degrade Mode | Fallback |
|---|---|---|
| WS Bridge down | Switch clients to long-poll; disable optional features (live diff) | Increase poll timeout to 45 s, jitter |
| Redis unavailable / high latency | Bypass Redis; SWR serve stale ≤2 s; coalesce requests | Tighten per-tenant concurrency; warm cache after recovery |
| CRDB read slowness | Raise Resolve deadlines to 400 ms; prefer cache; shed low-priority traffic | 429 on hot tenants; backpressure publishes |
| Bus throttling | Slow fan-out, batch invalidations | Replay after throttle; keep sdk long-poll working |
| Adapter throttling | Reduce batch size; backoff; serialize writes | Queue to outbox; DRY-RUN validation only |
| PDP degraded | Gateway ext_authz TTL uses last good (≤ 60 s) | Fail-closed on admin ops; show banner in Studio |
Chaos Engineering Program¶
Steady-state hypothesis: Under defined faults, SLOs hold, or the system degrades predictably (bounded latency, clear errors), and recovers automatically without manual intervention.
Experiments Matrix¶
| ID | Fault Injection | Scope | Hypothesis | Success Criteria |
|---|---|---|---|---|
| C-1 | Add 200 ms latency to Redis | Single region, 15 min | p99 Resolve ≤ 400 ms; cache hit ratio dips < 10% | No SLO breach; circuits stay closed; SWR < 2 s |
| C-2 | Redis node killover | 1 shard primary | Operations continue; brief p99 blip | No data loss; recovery < 30 s; 0 evictions |
| C-3 | CRDB leaseholder move | Hot range | Miss path p99 ≤ 500 ms | Txn restarts < 3%; no timeouts |
| C-4 | Bus 429/ServerBusy | Orchestrator consumers | Propagation p95 ≤ 8 s | DLQ stable; replay after clear |
| C-5 | WS Bridge crash loop | Region | Clients auto-downgrade to long-poll | Connections restore; no error budget burn |
| C-6 | DNS blackhole to Adapter provider | Single binding | Hub circuits open per binding only | Other tenants unaffected; queued ops replay |
| C-7 | Partial partition (drop 10% packets) | Gateway↔Registry | Retries/backoff prevent storms | No cascading failure; HPA scales sanely |
| C-8 | Token/JWKS rollover mid-traffic | Edge | Zero auth errors beyond overlap | Authn failure rate < 0.1% spike |
| C-9 | Time skew 2 min on 10% pods | Mixed services | Token validation & TTL tolerate | No systemic 401/304 anomalies |
| C-10 | Region failover drill | Full region | RTO within playbook; SLOs in surviving regions | No cross-tenant leakage; clear comms |
Tooling
- Layer 4/7 faults: Envoy fault filters, ToxiProxy.
- Platform: chaos mesh or k6 chaos; K8s PDBs validated.
- Data: CRDB workload generator, leaseholder move via
ALTER TABLE ... EXPERIMENTAL_RELOCATE.
Schedule
- GameDays quarterly, rotating owners; pre-approved windows.
- Results recorded with hypotheses, evidence, fixes.
Runbooks (actionable)¶
RB-1 Resolve p99 > target (sustained 10 min)¶
- Dashboards: check
cache_hits_total,redis.latency_ms,coalesce_ratio,waiter_pool_size. - If Redis latency > 3 ms p95 → add shard, enable client-side SWR (2 s), increase Redis pool.
- If coalesce_ratio < 2 → raise waiter debounce to 300–500 ms.
- HPA: verify Registry replicas not capped; raise
maxReplicastemporarily.
RB-2 429 spikes for a tenant¶
- Confirm token bucket at Gateway; inspect tenant QPS.
- If legitimate spike, increase burst temporarily; otherwise enable cooldown and inform tenant.
- Enable feature flag to reduce poll frequency for that tenant.
RB-3 DLQ growth in refresh pipeline¶
- Inspect latest DLQ messages (reason); if transient → bulk replay.
- If schema/contract mismatch → parking lot and open incident; restrict publish to canary.
- Scale Orchestrator via KEDA; check bus quotas.
RB-4 Adapter throttling¶
- Reduce batch size 50%, increase backoff to max 60 s.
- Mark binding degraded; notify tenant (Studio banner).
- After clear, replay pending ops; compare provider vs desired.
RB-5 WS instability (drop rate > 1%/h)¶
- Check ephemeral ports, GC pauses on pods; rotate nodes if needed.
- Switch affected tenants to long-poll via feature flag.
- Audit heartbeats & missed count; tune keep-alive timeouts.
RB-6 CRDB restart spikes / txn restarts > 3%¶
- Inspect hot ranges; consider index tweak or split.
- Increase SQL pool temporarily; ensure queries are parameterized and short.
- Scale Registry reads; verify leaseholder locality.
Implementation Guidance (services & SDKs)¶
.NET services
- Use Polly (or built-in resilience pipeline in .NET 8) for retry/circuit/bulkhead.
- One HttpClient per dependency/host; enable HTTP/2 for gRPC.
- Redis: use pipelining and cancellation tokens; set
SyncTimeout=redis_ms.
JS/TS SDK
- Backoff via
fetchwrapper; AbortController for deadlines. - WS reconnect with decorrelated jitter; resumeAfter cursor.
- Persistent cache guarded by ETag; SWR toggled via server hint.
Mobile
- Network reachability gating; background fetch limit; throttle under battery saver.
Config knobs (server)
- All limits under
resiliency.*with tenant/edition overrides; hot-reload on config change. - Emit current effective policy into metrics (
resiliency_profile{name="standard"}gauge = 1).
Testing & Verification¶
- Unit: retry/circuit behavior with virtual time; idempotency invariants.
- Contract: simulate
429/503/UNAVAILABLE/DEADLINE_EXCEEDEDacross clients. - Load: soak tests 24h with chaos toggles (C-1..C-7).
- Failover drills: at least semi-annual region evacuation exercises.
Acceptance Criteria (engineering hand-off)¶
- Resiliency profile library packaged; services load standard by default with overrides.
- Polly/Resilience pipelines configured for Registry, Orchestrator, Hub, WS Bridge with metrics and logs.
- Circuit breaker telemetry and alerts active (
ecs.circuit.state,failure_rate). - Chaos runners and manifests for experiments C-1..C-7 in staging; results documented.
- Runbooks RB-1..RB-6 linked in on-call docs; PagerDuty alerts mapped to owners.
Solution Architect Notes¶
- Keep retry budgets tight; retries amplify load—prefer fail fast + fallback.
- Favor per-tenant breakers to preserve global health during hot-tenant incidents.
- Extend chaos to client side (SDKs) in sandbox apps to validate downgrade paths (WS→poll, stale serves).
- Consider adaptive concurrency (AIMD) at Gateway if 429s recur across many tenants.
Migration & Import — bootstrap pathways, bulk import, diff reconcile, blue/green config cutover¶
Objectives¶
Provide safe, repeatable pathways to bring existing configurations into ECS, reconcile differences, and execute a zero-downtime cutover to ECS-managed configuration. This section defines bootstrap options, bulk import contracts, three-way diff & reconcile, and blue/green cutover patterns with runbooks and acceptance criteria.
Migration Personas & Starting Points¶
| Persona | Starting System | Typical Shape | Primary Path |
|---|---|---|---|
| Platform Admin | Fresh tenant | No prior config | Bootstrap Empty (starter templates) |
| SRE/DevOps | Azure AppConfig / AWS AppConfig | Hierarchical keys + labels | Provider-Sourced Import via Adapter Hub |
| Backend Dev | Files in Git (JSON/YAML) | Namespaced files per env | File-Based Import (CLI/Studio) |
| Operations | Consul/Redis/SQL | Flat/prefix keys | Provider-Sourced Import + Mapping Rules |
Bootstrap Pathways¶
1) Bootstrap Empty (Templates)¶
- Create tenant, apps, envs, namespaces.
- Seed starter Config Sets and policy packs (edition overlays).
- Protect with schema validation out of the gate.
sequenceDiagram
participant Admin
participant Studio
participant Registry
Admin->>Studio: Create Tenant/Apps/Envs (wizard)
Studio->>Registry: POST /tenants/... (idempotent)
Registry-->>Studio: Seed templates + policies
2) Provider-Sourced Import (Adapters)¶
- Use Adapter Hub to read from source (Azure/AWS/Consul/Redis/SQL).
- Produce intermediate snapshot in ECS format.
- Run validate + diff against ECS baseline; reconcile, then publish.
3) File-Based Import (CLI/Studio)¶
- Upload ZIP containing
manifest.yaml+configs/*.json|yaml. - CLI validates schema locally, calculates hash, and performs idempotent batch import.
Import Data Model & Contracts¶
Canonical Import Manifest (YAML)¶
apiVersion: ecs.migration/v1
kind: ImportBundle
metadata:
tenantId: t-123
source: azure-appconfig://appcfg-prod?label=prod
changeId: mig-2025-08-25T10:00Z # idempotency key
spec:
defaultEnvironment: prod
mappings:
- sourcePrefix: "apps/billing"
targetNamespace: "billing"
environment: "prod"
keyTransform: "stripPrefix('apps/billing/')" # helpers: stripPrefix, toKebab, replace
contentTypeRules:
- match: "**/*.json"
contentType: "application/json"
items:
- key: "apps/billing/db/connectionString"
valueRef: "kvref://vault/billing/db-conn#v3" # secrets as refs
meta: { labels: ["prod","blue"] }
- key: "apps/billing/featureToggles/enableFoo"
value: true
meta: { contentType: "application/json" }
Rules
- Secrets: only
valueRef (kvref://…)allowed for sensitive fields; plaintext rejected by policy. - Idempotency:
metadata.changeIdrequired; server dedupes full bundle and each batch chunk. - Size limits: default 5k items/bundle (configurable); chunks of ≤500 items.
REST & CLI¶
POST /v1/imports (multipart/zip or application/json)
Headers:
Idempotency-Key: mig-2025-08-25T10:00Z
Response: { importId, statusUrl }
GET /v1/imports/{importId}/status
CLI
ecsctl import apply ./bundle.zip --tenant t-123 --dry-run
ecsctl import plan ./bundle.zip # prints diff summary
ecsctl import approve <importId> # kicks reconcile+publish with policy gates
Three-Way Diff & Reconcile¶
States
- Desired: Import bundle (or provider snapshot post-mapping)
- Current: ECS head (latest published Snapshot/Version)
- Last-Applied: Previous import’s applied hash (if any) to avoid flip-flop
flowchart LR
D[Desired] --- R{Reconcile Engine}
C[Current] --- R
L[Last-Applied] --- R
R --> Patch[JsonPatch + Semantic diff]
Patch --> Plan[Change Plan: upserts/deletes/moves]
Algorithm
- Normalize keys & canonicalize JSON.
- Compute structural diff (RFC 6902) and semantic annotations (breaking/additive).
- Build Change Plan:
- Group by namespace/env.
- Respect ignore rules (e.g., provider metadata keys).
- For conflicts (key renamed vs deleted), prefer rename if mapping rule indicates.
- Validate with JSON Schema + Policy PDP → may yield obligations (approvals, canary).
Outcomes
- Dry-run report: counts (add/remove/replace), risk score, policy obligations.
- Apply mode: create Draft, then Snapshot on success (idempotent by content hash).
Bulk Import from Providers (Adapters)¶
sequenceDiagram
participant Admin
participant Hub as Adapter Hub
participant A as Provider Adapter
participant Reg as Registry
Admin->>Hub: StartImport(tenant/env/ns, source)
Hub->>A: List(prefix, pageSize=500)
A-->>Hub: Items(page) + etags
loop until done
Hub->>Reg: POST /imports:chunk (batch=≤500)
Reg-->>Hub: ChunkAccepted (hash)
Hub->>A: nextPage()
end
Hub->>Reg: FinalizeImport(changeId)
Reg-->>Admin: Plan ready (diff + obligations)
Resiliency
- Each chunk has idempotency key:
<changeId>-<pageNo>. - Backpressure: throttle to respect provider quotas; exponential backoff on 429/ServerBusy.
- DLQ for malformed items with replay token.
Blue/Green Config Cutover¶
Goal: Switch consumers from Blue (current alias) to Green (new snapshot) without downtime and with instant rollback.
Mechanics¶
- Publish new snapshot → tag semver (e.g.,
v1.9.0). - Pin environment alias:
prod-next→v1.9.0(Green). - Canary rollout (optional): restrict RefreshEvents to a slice of services.
- Promote: flip
prod-currentalias from Blue → Green. - Rollback: re-point
prod-currentto prior Blue snapshot (previous alias pointer).
sequenceDiagram
participant Studio
participant Registry
participant Orchestrator
Studio->>Registry: Tag v1.9.0; Alias prod-next -> v1.9.0
Orchestrator-->>Services: Refresh(selectors=canary)
Note right of Services: validate metrics & errors
Studio->>Registry: Alias prod-current -> v1.9.0 (cutover)
Orchestrator-->>All Services: Refresh(all)
Studio->>Registry: Rollback (if needed) -> prod-current -> v1.8.3
Guarantees
- Cutover changes only alias pointers (constant-time).
- SDKs always revalidate via ETag; if unchanged, 304 path is cheap.
- Rollback emits
ConfigPublishedfor the restored alias, preserving at-least-once semantics.
Workflows¶
A) File-Based Import → Plan → Approve → Cutover¶
- Prepare bundle (
ecsctl import plan). - Dry-run in Studio: policy results + risk score; attach change window if prod.
- Approve (SoD, required approvers).
- Apply: create Draft → Snapshot; tag
vX.Y.Z. - Canary (optional):
prod-next→vX.Y.Z; verify metrics. - Cutover:
prod-current→vX.Y.Z. - Audit: export plan, approvals, result.
B) Provider-Sourced Live Sync (One-time Migration)¶
- Use Adapter to List source keys; apply mappings.
- Run Reconcile; fix schema violations (add secret refs).
- Freeze writes at source (change window).
- Apply plan; cutover aliases.
- Lock ECS as source of truth (Optional: disable reverse sync).
Mapping & Transformation Rules¶
| Capability | Options |
|---|---|
| Key transforms | stripPrefix, replace(pattern,repl), toKebab, toCamel, lowercase |
| Label/env mapping | Provider labels → ECS environment or meta.labels[] |
| Content types | Infer from extension or explicit contentTypeRules |
| Secret detection | Regex + schema hints → require kvref:// |
| Ignore sets | Drop provider control keys (e.g., _meta, __system/*) |
Validation
- Mapping rules tested in Preview panel with sampled keys.
- Rules stored with ImportBundle to ensure reproducibility.
Safety & Idempotency¶
- Idempotency-Key on bundle + chunks; server returns 200/AlreadyApplied if identical.
- Snapshot creation idempotent by content hash; repeat imports do not create duplicates.
- Write guards: In prod, publish requires approvals & change window policy pass.
- Quotas: import throughput rate-limited per tenant; defaults 200 items/s.
Observability¶
- Spans:
ecs.migration.import.plan,ecs.migration.import.apply,ecs.migration.reconcile,ecs.migration.cutover. - Metrics:
import_items_total{result=applied|skipped|invalid}reconcile_conflicts_total{type=rename|delete|typeMismatch}cutover_duration_msrollback_invocations_total
- Logs: structured records with
changeId,bundleHash,planHash; no values logged.
Failure Modes & Recovery¶
| Failure | Symptom | Action |
|---|---|---|
| Schema violation | Plan shows blocking errors | Fix mapping or schema; re-plan |
| Secret plaintext detected | Policy DENY | Convert to kvref://; re-plan |
| Provider throttling | Slow import | Hub backs off; resumes; no data loss |
| Partial apply due to timeout | Some chunks pending | Re-POST with same changeId; idempotent |
| Bad cutover | Error spikes | Flip alias back to Blue; open incident; analyze diff |
Runbooks¶
RB-M1 Plan & Dry-Run¶
ecsctl import plan→ review counts & policy summary.- If risk ≥ threshold, escalate to Approver/Tenant Admin.
RB-M2 Approval & Apply¶
- Ensure change window active for prod.
- Approve in Studio; monitor pdp_decisions_total for obligations.
- Apply; verify
publish_total{result="success"}.
RB-M3 Canary & Cutover¶
- Point
prod-nextto new version; watch propagation_lag_ms and service error %. - If stable for N minutes, flip
prod-current. - If regression, rollback immediately (alias revert).
RB-M4 Provider Freeze & Final Sync¶
- Freeze writes on source; take final snapshot via adapter.
- Re-plan; apply minimal delta.
- Mark ECS authoritative; decommission source path.
Acceptance Criteria (engineering hand-off)¶
- Import API + CLI support ZIP & JSON; enforces Idempotency-Key and size limits.
- Reconcile engine implements three-way diff with semantic annotations and ignore sets.
- Adapter Hub path supports paged list, chunked import, backoff, DLQ & replay.
- Studio provides Plan view (diff + policy), Approval wiring, Cutover and Rollback buttons with audit.
- Aliasing supports prod-next/prod-current conventions; cutover & rollback are O(1) pointer flips with events emitted.
- End-to-end tests: file import, provider import, conflict resolution, blue/green cutover, rollback.
Solution Architect Notes¶
- Prefer one-time import then lock ECS as source of truth to avoid dual-write drift; if bi-directional is unavoidable, enforce adapter watch + reconcile with clear owner.
- Keep mapping rules versioned with import artifacts; they are part of compliance evidence.
- For very large tenants, stage import by namespace and use canary cutovers per namespace to reduce risk.
- Consider a read-only preview environment wired to prod-next for smoke testing with synthetic traffic before global cutover.
Compliance & Auditability — audit schema, retention, export APIs, SOC2/ISO hooks, PII posture¶
Objectives¶
Establish a tamper-evident, privacy-aware audit layer with clear retention policies, export/eDiscovery APIs, and baked-in hooks for SOC 2 and ISO 27001 evidence. Guarantee multi-tenant isolation, cryptographic integrity, and least-PII practices across all audit data.
Audit Model & Tamper Evidence¶
Event domains¶
- Config Lifecycle: draft edits, validations, snapshots, tags/aliases, deployments, rollbacks.
- Policy & Governance: PDP decisions, risk scores, obligations, approval requests/grants/rejects, change windows.
- Access & Security: logins (Studio), token failures, role/permission changes, break-glass usage.
- Adapters & Refresh: provider sync start/finish, drift detected, cache invalidations, replay actions.
- Administrative: retention changes, exports, legal holds, backup/restore actions.
Canonical event (JSON Schema 2020-12 excerpt)¶
{
"$id": "https://schemas.connectsoft.io/ecs/audit-event.json",
"type": "object",
"required": ["eventId","tenantId","time","actor","action","resource","result","prevHash","hash"],
"properties": {
"eventId": { "type": "string", "description": "UUIDv7" },
"tenantId": { "type": "string" },
"time": { "type": "string", "format": "date-time" },
"actor": {
"type": "object",
"properties": {
"subHash": { "type": "string" }, // salted hash, not raw subject
"iss": { "type": "string" },
"type": { "enum": ["user","service","admin","platform-admin"] },
"ip": { "type": "string" } // truncated / anonymized per policy
}
},
"action": { "type": "string", "enum": [
"Config.DraftEdited","Config.SnapshotCreated","Config.TagUpdated","Config.AliasUpdated",
"Config.Published","Config.RolledBack","Policy.Decision","Policy.Updated",
"Approval.Requested","Approval.Granted","Approval.Rejected",
"Access.RoleChanged","Access.BreakGlass","Adapter.SyncCompleted","Refresh.Invalidate",
"Export.Started","Export.Completed","Retention.Updated","Backup.Restore"
]},
"resource": {
"type": "object",
"properties": {
"type": { "enum": ["ConfigSet","Snapshot","Policy","Approval","Role","Adapter","Export","Tenant"] },
"id": { "type": "string" },
"path": { "type": "string" } // never secret values; path hashed if sensitive
}
},
"result": { "type": "string", "enum": ["success","denied","error"] },
"diffSummary": { "type": "object", "properties": { "breaking": {"type":"integer"}, "additive":{"type":"integer"}, "neutral":{"type":"integer"} } },
"policy": { "type":"object", "properties": { "effect":{"enum":["allow","deny","obligate"]}, "rulesMatched":{"type":"array","items":{"type":"string"}} } },
"approvals": { "type":"array", "items": { "type":"object", "properties": { "bySubHash":{"type":"string"}, "at":{"type":"string","format":"date-time"}, "result":{"enum":["granted","rejected"]} } } },
"etag": { "type": "string" },
"version": { "type": "string" },
"prevHash": { "type": "string" },
"hash": { "type": "string" }, // SHA-256(prevHash || canonicalBody)
"signature": { "type": "string" } // optional JWS for daily manifests
}
}
Hash chain & manifests¶
- Per-tenant, per-day hash chain: every event stores
prevHashandhash=SHA-256(prevHash||canonicalBody). - Daily manifest per tenant:
{ day, firstEventId, lastEventId, rootHash, count }, signed (JWS) with KMS key (Enterprise). - Verification:
ecsctl audit verify --tenant t --from 2025-08-01 --to 2025-08-31reconstructs chains and validates signatures.
Storage Tiers & Retention¶
| Tier | Store | Purpose | Default Retention (Starter / Pro / Enterprise) |
|---|---|---|---|
| Hot | CockroachDB (event_audit) |
recent queries, Studio timelines | 90d / 180d / 365d |
| Warm | Object store (Parquet, partitioned by tenantId/day) |
eDiscovery, analytics | — / 12m / 36m |
| Cold/Archive | Object archive (WORM optional) | long-term compliance | — / — / 7y |
Mechanics
- Hot tier uses row-level TTL (
ttl_expires_at) with hourly jobs. - Nightly compaction/export → Parquet (snappy), plus signed manifest &
rootHash. - Legal hold flag on tenant prevents TTL purge and export deletion; records linked ticket id & actor.
Export & eDiscovery APIs¶
Filters & Query DSL¶
- Filterable fields:
time range,tenantId,environment,action,resource.type/id,actor.type,result,correlationId,policy.effect. - Simple DSL (AND by default, OR with
|):time >= 2025-08-01 AND action:Config.Published AND result:success AND env:prod
Endpoints¶
POST /v1/audit/exports
Body:
{
"tenantId": "t-123",
"query": "time >= 2025-08-01 AND action:Config.Published",
"format": "parquet|csv|ndjson",
"redaction": { "hashSubjects": true, "truncateIp": true, "dropFields": ["resource.path"] },
"sign": true, // Enterprise: attach JWS over manifest
"encryption": "kms://key-ref", // optional server-side encryption for export files
"notify": ["mailto:secops@tenant.com"]
}
GET /v1/audit/exports/{exportId}/status
GET /v1/audit/exports/{exportId}/download // presigned URL, time-limited
DELETE /v1/audit/exports/{exportId} // marks for deletion if not on legal hold
Streaming/listing
Export manifest (signed)
{
"exportId":"e-7c..",
"tenantId":"t-123",
"range":{"from":"2025-08-01T00:00:00Z","to":"2025-08-15T23:59:59Z"},
"count": 12876,
"files":[{"path":"s3://.../part-0001.snappy.parquet","sha256":"..."}],
"rootHash":"...", "signature":"eyJhbGciOiJQUzI..." // optional
}
Audit Access Model¶
| Role | Capabilities |
|---|---|
| Viewer | Read hot timeline for own tenant; no export |
| Security Auditor | Create/Download exports with redaction presets; verify chains |
| Tenant Admin | Manage retention (within policy bounds), legal holds, export keys |
| Platform Admin | Cross-tenant export (break-glass only, ticket required) |
Segregation of Duties: authors of changes cannot approve their own changes and cannot delete/alter audit data. All audit access is itself audited.
SOC 2 / ISO 27001 Hooks¶
Control coverage map (examples)¶
| Domain | Control Objective | Evidence Source |
|---|---|---|
| Change Management | Approvals required; rollback procedures | Audit events Approval.*, Config.Published, manifests |
| Logical Access | Least privilege enforced | Role/permission exports, PDP Decision samples |
| Logging & Monitoring | Audit logging with integrity | Hash chains, signed manifests, collector configs |
| Encryption | Data at rest/in transit | KMS key inventory, TLS config exports |
| Backup & Recovery | Regular backups, restores tested | Backup job logs, restore runbooks & attestations |
| Vendor Management | Adapter access controls | Adapter binding audits, credential rotations |
Evidence automation¶
- Weekly Evidence Pack job (per tenant, Enterprise): ZIP with
- Signed audit manifest for the week
- Access/role assignment CSV
- Policy bundle digest + PDP cache etag
- SLO dashboards (PDF exports)
- Backup status & last restore drill summary
- Webhooks:
ecs.compliance.v1.EvidencePackReadyfor GRC integration.
PII Posture & Privacy¶
Principles¶
- Minimize: audit stores metadata, never raw config values or secrets.
- Pseudonymize: user identifiers stored as salted hashes (
actor.subHash). - Masking: IPs truncated or anonymized per tenant policy; paths containing known PII patterns hashed.
- Tagging: audit schema includes
dataClasslabels; exporters drop high-risk fields by default.
Data subject requests (DSAR)¶
- DSAR search runs against audit metadata only; personal data is pseudonymized—responses include existence proofs without revealing sensitive content.
- Erase: where law requires, erase user identifiers by rotating salt/mapper, preserving event integrity. Chain integrity is retained by erasing only derived PII fields, not event core.
Residency¶
- Audit data written to region of tenant’s home region; exports enforce regional buckets.
Operational Schema & DDL (illustrative)¶
CREATE TABLE ecs.event_audit (
tenant_id UUID NOT NULL,
event_id UUID NOT NULL DEFAULT gen_random_uuid(),
time TIMESTAMPTZ NOT NULL DEFAULT now(),
actor_sub_hash STRING NOT NULL,
actor_iss STRING NOT NULL,
actor_type STRING NOT NULL,
action STRING NOT NULL,
resource_type STRING NOT NULL,
resource_id STRING NULL,
resource_path_hash STRING NULL,
result STRING NOT NULL,
diff_summary JSONB NULL,
policy JSONB NULL,
etag STRING NULL,
version STRING NULL,
prev_hash STRING NOT NULL,
hash STRING NOT NULL,
ttl_expires_at TIMESTAMPTZ NULL,
correlation_id STRING NULL,
crdb_region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(),
CONSTRAINT pk_event_audit PRIMARY KEY (tenant_id, time, event_id)
) LOCALITY REGIONAL BY ROW
WITH (ttl = 'on',
ttl_expiration_expression = 'ttl_expires_at',
ttl_job_cron = '@hourly');
CREATE INDEX ix_audit_action ON ecs.event_audit (tenant_id, action, time DESC);
CREATE INDEX ix_audit_resource ON ecs.event_audit (tenant_id, resource_type, resource_id, time DESC);
Observability of the Audit Pipeline¶
- Metrics:
audit_events_total{result},audit_chain_verify_failures_total,audit_export_jobs_running,audit_export_duration_ms,legal_holds_total. - Traces:
ecs.audit.record,ecs.audit.export,ecs.audit.verify(linked tox-correlation-id). - Alerts:
- Chain verification failures > 0 in 5m
- Export job failures > 0 in 15m
- Hot-tier backlog (TTL job lag) > 30m
Runbooks¶
RB-C1: Verify chain integrity for a tenant/day
ecsctl audit verify --tenant t-123 --date 2025-08-20- If fail: fetch daily manifest; recompute locally; compare
rootHash. - If mismatch persists: open SEV-2, freeze retention, snapshot hot partition, start forensics.
RB-C2: Respond to auditor request (SOC 2)
- Create Evidence Pack for period →
POST /v1/audit/exportswithsign=true. - Attach role/permission export and change management approvals.
- Provide verification steps & public cert/JWKS link.
RB-C3: Legal hold
- Set tenant
legalHold=truewith ticket ID. - Confirm TTL job skips held partitions; replicate warm data to hold bucket.
- Audit all actions with
LegalHoldSetevent.
Acceptance Criteria (engineering hand-off)¶
- Audit events emitted for all enumerated domains with hash chain fields populated.
- Row-level TTL and nightly Parquet exports with signed manifests (Ent).
- Export/eDiscovery API implemented with query DSL, redaction presets, KMS encryption, and presigned downloads.
- Role-gated access with SoD; all audit access audited.
- Evidence Pack automation and webhook delivered.
ecsctlcommands for verify, export, manifest show.- DSAR procedures documented; salt rotation mechanism implemented for pseudonymized identifiers.
Solution Architect Notes¶
- Keep manifest signing feature-flagged for Pro; enforce for Enterprise tenants handling regulated workloads.
- Consider Merkle tree roots per hour to enable partial range verification at scale.
- Align export schemas with common GRC tools to avoid ETL (field names stable, enums documented).
- Schedule quarterly integrity drills that verify a random sample of tenants and produce a signed attestation.
Cost Model & FinOps — storage/cache costs per tenant, egress, adapter costs, throttling strategies¶
Objectives¶
Define a transparent, meter-driven cost model and FinOps practices so ECS can:
- Attribute platform costs per tenant (showback/chargeback).
- Forecast and steer spend for storage, cache, egress, adapters, and observability.
- Enforce edition-aware throttles and autoscaling that protect SLOs and budgets.
Cost Architecture (what we meter & how we allocate)¶
flowchart LR
subgraph Meters
RQ[Resolve Calls]
WS[WS/LP Connection Hours]
RBH[Redis Bytes-Hours]
SBH[Storage Bytes-Hours (CRDB)]
EGR[Egress Bytes]
EVT[Refresh Events]
ADP[Adapter Ops (read/write/watch)]
KMS[KMS/Secrets Ops]
OBS[Telemetry (traces/metrics/logs)]
end
Meters --> UL[Usage Ledger (per-tenant, per-day)]
UL --> CE[Cost Engine (unit price table by region/provider)]
CE --> SB[Showback/Chargeback]
CE --> BDG[Budgets & Alerts]
CE --> FP[Forecasts (q/q)]
Usage Ledger (authoritative)¶
Per tenant/day we persist:
resolve_calls,resolve_egress_bytesws_connection_hours,longpoll_connection_hoursredis_bytes_hours(avg resident bytes × hours)crdb_storage_bytes_hours(table + indexes)snapshots_created,snapshot_bytes_exportedrefresh_events_published,dlq_messagesadapter_ops_{provider}.{get|put|list|watch},adapter_egress_byteskms_ops,secret_resolutionsotel_spans_ingested,metrics_series,logs_gb
All meters are derived from production telemetry; aggregation jobs roll up hourly → daily to bound cardinality.
Unit Price Table (configurable per region/provider)¶
| Meter | Unit | Key drivers | Example field names (in config) |
|---|---|---|---|
| CRDB storage | GB-hour | data + indexes | price.crdb.gb_hour[region] |
| Redis cache | GB-hour | memory footprint | price.redis.gb_hour[region] |
| Internet egress | GB | API payloads, WS traffic | price.egress.gb[region] |
| Event bus | 1K ops | publish + consume | price.bus.kops[region] |
| Adapter API | 1K ops | provider calls | price.adapter.{provider}.kops[region] |
| KMS/Secrets | 1K ops | decrypt/get | price.kms.kops[cloud], price.secrets.kops[cloud] |
| Observability | GB / span | logs/metrics/traces | price.obs.logs.gb, price.obs.traces.kspans |
Do not hardcode cloud list prices in code; load them from ops-config, per account/contract.
Per-Tenant Cost Formulas (parametric)¶
Let P_* be unit prices, and U_* the usage meters for tenant T in period D.
Cost_CRDB(T,D) = U_storage_gb_hours * P_crdb_gb_hour
Cost_Redis(T,D) = U_redis_gb_hours * P_redis_gb_hour
Cost_Egress(T,D) = U_egress_gb * P_egress_gb
Cost_Bus(T,D) = (U_bus_publish + U_bus_consume) / 1000 * P_bus_kops
Cost_Adapter(T,D)= Σ_provider (U_adapter_ops_provider / 1000 * P_adapter_provider_kops)
Cost_KMS(T,D) = U_kms_ops / 1000 * P_kms_kops + U_secret_gets / 1000 * P_secrets_kops
Cost_Obs(T,D) = U_logs_gb * P_logs_gb + U_spans_k * P_traces_kspans + U_metrics_series * P_metrics_series
Total(T,D) = Σ all components
Attribution rules
- Redis & CRDB bytes-hours: weighted by
tenant_idpartition sizes; Redis estimated from on-host key scans sampled each minute × hours. - Egress: sum response sizes at Gateway by tenant (compressed bytes on wire).
- Adapter ops: counted at Adapter Hub (SPI calls); include retries (we also report retry rate to optimize).
- Observability: charge low cardinality metrics as platform; high-volume logs/spans proportional to tenant action rate.
Unit Economics (starter baselines)¶
Define internal pricebook for showback and external chargeback (if applicable):
| Edition | Included monthly quota (guardrails) | Overage rate examples |
|---|---|---|
| Starter | 5M resolve_calls, 5 GB egress, 1 GB-mo storage, 1 GB-mo Redis |
per 1M resolves, per GB egress, per GB-mo storage/cache |
| Pro | 50M resolves, 50 GB egress, 10 GB-mo storage, 5 GB-mo Redis, 5M bus ops | tiered overage discounts |
| Enterprise | Custom commit, premium WS | contract-specific |
Quotas power shaping & alerts; they are not hard limits unless policy enforces.
Cost Dashboards & Budgets¶
Tenant Cost Overview
- Cost by component (CRDB, Redis, Egress, Bus, Adapter, KMS, Obs)
- Unit drivers (resolves, events, bytes) + efficiency KPIs:
cache_hit_ratio,avg_payload_bytes,retry_rate,propagation_lag
- Budget progress bars and anomaly score (see below)
Platform FinOps
- Cost per region, per edition, per service
- Redis GB-hours vs hit ratio; CRDB GB-hours vs version growth
- Egress by route; adapters ops by provider; top 10 hot tenants
Budgets & Alerts
- Soft budget: 80% warn, 100% alert, 120% cap recommendation
- Anomaly detection (day-over-day z-score) on: egress, adapter ops, logs GB
Throttling & Cost-Shaping Strategies (edition-aware)¶
| Cost Driver | Strategy | Control Point |
|---|---|---|
| Egress | ETag everywhere, gzip, gRPC compression, field projection (exclude unchanged) | Gateway/Resolve |
| Resolve QPS | Default long-poll; WS only for eligible tenants; increase poll interval under load | SDK feature flags |
| Redis GB-hours | Cap value size (≤128KB), SWR ≤ 2s, LRU policy, hash-tag by tenant | Registry/Redis |
| CRDB growth | TTLs on audit/diffs, archive exports to object store, semantic diffs to reduce payload | Registry/Jobs |
| Adapter ops | Batch list/put, backoff on 429, delta cursors, import windows | Adapter Hub |
| Observability | Sample traces (tail-based), logs sampling on INFO, keep structured metrics only | OTEL Collector |
| Bus ops | Coalesce refresh events per ETag & scope, suppress duplicates within 250ms | Orchestrator |
| WS costs | Cap connections/tenant, idle timeouts, downgrade to long-poll when idle | WS Bridge |
Policy hooks
- PDP can obligate tenants to reduce cost (e.g., raise poll interval) when over budget.
Example Tenant Scenarios (illustrative, not a quote)¶
Assume (for math only):
P_redis_gb_hour=$0.02, P_crdb_gb_hour=$0.01, P_egress_gb=$0.08, P_bus_kops=$0.15, P_adapter_azapp_kops=$0.50.
Pro Tenant (monthly)¶
- 30M resolves, avg payload 10 KB compressed, cache hit 90% → egress ≈
3M * 10KB ≈ 28.6 GB - Redis avg resident 3 GB →
3 GB * 720 h = 2160 GB-h - CRDB storage avg 8 GB →
5760 GB-h - Bus ops 2M (publish+consume)
- Adapter AzureAppConfig ops 500k
Costs
- Redis:
2160 * 0.02 = $43.20 - CRDB:
5760 * 0.01 = $57.60 - Egress:
28.6 * 0.08 ≈ $2.29 - Bus:
2,000k/1,000 * 0.15 = $300 - Adapter:
500k/1,000 * 0.50 = $250 - Subtotal ≈ $653 (+ observability overhead if tenant-heavy)
Actionables to reduce: raise hit ratio to 92%, coalesce refreshes, batch adapter writes.
Optimization Playbook (FinOps levers)¶
- Reduce payload size: prune keys, compress, project; target
≤12 KBp50. - Increase cache hits: ensure SDKs default long-poll/WS + ETag; fix chatty clients.
- Right-size Redis: measure
redis_bytes_hours; add shards only iflatency p95 > 3 msorevictions > 0. - Trim storage: enforce TTLs on diffs/audit (per edition), export old versions to Parquet.
- Adapter efficiency: switch to delta sync; widen polling intervals; schedule imports in off-peak windows.
- Observability costs: cap log verbosity, keep histograms with exemplars; drop high-cardinality labels.
- Network: prefer in-region access (avoid x-region egress); push WS only for active namespaces.
Governance: Showback/Chargeback¶
- Showback: monthly PDF/CSV per tenant with:
- Usage meters, effective unit prices, total by component, trend chart, anomaly notes.
- Chargeback (optional): map to SKU:
CFG-READ(per 1M resolves),CFG-STOR(per GB-mo),CFG-CACHE(per GB-mo),CFG-EGR(per GB),CFG-ADP-{provider}(per 1K ops).
- Contract hooks: Enterprise tenants can pre-commit capacity (discounted) with alerts if sustained > 120% for 3 days.
Automation & Data Flow¶
- Tagging/Labels: all resources include
cs:tenant,cs:service,cs:env,cs:edition. - Export jobs push Usage Ledger to the cost platform (per cloud): CUR/BigQuery/Billing export + our enrichment.
- Forecasting: Holt-Winters on resolves/egress to project 30/90-day spend; include seasonality from release cadence.
Cost-Anomaly Runbooks¶
RB-F1 Egress Spike
- Dashboard → per-route egress; find tenant & path.
- Check
304 rate; if low, investigate missing ETags or payload bloat. - Temporarily raise client poll interval; enable field projection.
RB-F2 Adapter Op Surge
- Identify provider & binding; check throttle events.
- Increase batch size if under cap; otherwise backoff and schedule window.
- Notify tenant; lock dual-write if drift source identified.
RB-F3 Redis GB-hours Up
- Inspect top key sizes & TTL; enforce value size cap.
- Increase SWR window to 2s; review pinning policies.
- If still high, move cold namespaces to document aggregation (fewer large values, fewer keys).
RB-F4 Observability Overrun
- Lower log sampling on INFO; ensure tail-based tracing active.
- Drop unused metrics; dedupe labels; compress logs.
- Add budget guard at Collector; alert if exceeded.
HPA/KEDA & FinOps Coupling¶
- Scale on business metrics (in-flight requests, lag) not raw CPU only.
- Scale-down protection when budgets are healthy and latency SLOs are green; avoid oscillation that increases cost via cache churn.
- Scheduled scale for known peaks (releases) to reduce reactive over-scale.
Acceptance Criteria (engineering hand-off)¶
- Usage Ledger schema implemented; daily rollups persisted and queryable per tenant.
- Cost Engine reads regional price tables; produces per-tenant daily cost with component breakdown.
- Dashboards: Tenant Cost, Platform FinOps, Anomalies; budgets & alerts wired.
- Edition quotas and PDP obligations to shape high-cost behaviors (poll interval, WS eligibility).
- Showback export (CSV/PDF) and API endpoints for /billing/usage.
- Runbooks RB-F1..RB-F4 published; anomaly jobs (z-score) live.
- Unit tests validate meters vs traces/metrics; backfills for late-arriving telemetry.
Solution Architect Notes¶
- Keep unit prices externalized; treat FinOps math as configuration, not code.
- Revisit pricebook SKUs quarterly; align with actual cloud invoices.
- Consider per-tenant Redis only for Enterprise with extreme isolation needs; otherwise hash-tag + ACLs suffice.
- Evaluate request-level compression dictionaries for large JSON sections if payloads dominate egress.
- Add cost SLOs (e.g., $/1M resolves target) to drive continuous efficiency without harming latency SLOs.
Deployment Topology — AKS clusters, namespaces, regions, multi-AZ, blue/green & canary patterns¶
Objectives¶
Specify how ECS is deployed on Azure Kubernetes Service (AKS) with multi-region, multi-AZ resilience; enforce clean isolation via namespaces; and standardize progressive delivery (blue/green & canary) for services and the Studio UI, aligned to data residency and tenant tiers.
High-Level Topology¶
flowchart LR
AFD[Azure Front Door + WAF] --> EG[Envoy Gateway (AKS)]
EG --> YARP[YARP BFF]
EG --> REG[Config Registry]
EG --> PDP[Policy Engine]
EG --> ORC[Refresh Orchestrator]
EG --> WS[WS/LP Bridge]
EG --> HUB[Provider Adapter Hub]
REG <--> CRDB[(CockroachDB Multi-Region)]
REG <--> REDIS[(Redis Cluster)]
ORC <--> ASB[(Azure Service Bus)]
HUB <--> Providers[(Azure/AWS/Consul/Redis/SQL)]
subgraph AKS Cluster (per region)
EG
YARP
REG
PDP
ORC
WS
HUB
end
Edge & DNS
- Azure Front Door (AFD) + WAF terminates TLS, performs geo-routing, and health probes.
- Envoy Gateway in AKS handles per-route authn/authz, RLS, and traffic splitting for rollouts.
Environments, Clusters & Namespaces¶
| Environment | Cluster Strategy | Namespaces (examples) | Notes |
|---|---|---|---|
| dev | 1 AKS per region (cost-optimized) | dev-system, dev-ecs, dev-ops |
Lower SLOs; spot where safe |
| staging | 1 AKS per region | stg-ecs, stg-ops |
Mirrors prod; synthetic traffic |
| prod | 2–3 AKS clusters per geo (one per region) | prod-ecs, prod-ops, prod-ecs-blue, prod-ecs-green |
Blue/green via namespace swap |
Namespace policy
- NetworkPolicies deny-all by default; allow only service-to-service with SPIFFE IDs.
- Separate ops namespace for collectors, Argo CD/Rollouts, KEDA, Prometheus.
Regions & Data Residency¶
- Minimum 2 regions per geo (e.g., westeurope + northeurope) in active-active for stateless services; CockroachDB spans both with REGIONAL BY ROW (tenant home region pinned).
- Tenants tagged with homeRegion; AFD routes to nearest allowed region (residency enforced).
Multi-AZ & Scheduling Policy¶
AKS nodepools
np-system(DSv5) for control & gateways (taints:system=true:NoSchedule).np-services(DSv5) for app pods (balanced across zones ½/3).np-memq(Memory-optimized) for Redis if self-managed (recommend Azure Cache for Redis).np-bgfor blue/green surges (autoscaled on demand).
Workload policies
topologySpreadConstraints: ½/3 zones, max skew 1.PodDisruptionBudget: minAvailable 70% for stateless, 60% for WS Bridge.nodeAffinity: pin CRDB/Redis clients to low-latency pools.zone-awarereadiness: Envoy routes only to pods Ready in the same zone by default.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector: { matchLabels: { app: ecs-registry } }
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector: { matchLabels: { app: ecs-registry } }
topologyKey: kubernetes.io/hostname
Data Plane Deployments¶
CockroachDB (managed/self-hosted)¶
- Multi-region cluster; leaseholders localized to tenant home region;
REGIONAL BY ROWtables. - Zone redundancy: node groups across AZ ½/3 per region;
PDB+podAntiAffinity. - Backups to region-local blob storage; cross-region async copies nightly.
Redis¶
- Prefer Azure Cache for Redis (Premium/Enterprise) with zone redundancy.
- For self-managed: Redis Cluster with 3 primaries × 3 replicas per region;
aof=appendfsync-everysec. - Key hash-tagging
{tenant}to preserve per-tenant isolation.
Azure Service Bus¶
- Premium namespace per geo; consumer groups per service; ForwardTo DLQ and parking-lot queues.
Delivery Patterns — Blue/Green & Canary¶
Service Blue/Green (namespace-based)¶
- Each prod cluster hosts both
prod-ecs-blueandprod-ecs-green. - AFD → Envoy splits traffic by header/cookie or percentage to Services in the target namespace.
- Promote by flipping Envoy
HTTPRoute/TrafficSplitto Green and preserving Blue for instant rollback.
# Envoy Gateway (Gateway API) HTTPRoute snippet
kind: HTTPRoute
spec:
rules:
- matches: [{ path: { type: PathPrefix, value: /api/ }}]
filters:
- type: RequestHeaderModifier # inject tenant headers if needed
backendRefs:
- name: ecs-registry-svc-green
weight: 20
- name: ecs-registry-svc-blue
weight: 80
Canary (Argo Rollouts)¶
- Strategy: 5% → 25% → 50% → 100% with Analysis between steps using Prometheus queries:
http_request_duration_ms{route="/resolve",quantile="0.99"} < 400http_requests_error_ratio{route="/resolve"} < 0.5%pdp_decision_ms_p99 < 20
- Rollbacks on analysis failure; Blue remains untouched.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 3m }
- analysis: { templates: [{ templateName: ecs-slo-checks }] }
- setWeight: 25
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
Studio & Static Assets¶
- Blue/green buckets (
cdn-blue,cdn-green) behind AFD; swap origin on promote; immutable asset hashes prevent cache poison.
Configuration Cutover (Runtime)¶
- Use alias flip (
prod-current → vX.Y.Z) independent of service rollout (see Migration section). - Coordinate service canary with config canary for risky changes.
Autoscaling & Surge Capacity¶
- HPA for services (in-flight requests, CPU) with zone-balanced scale out.
- KEDA for event consumers (Service Bus queue depth; propagation lag).
- Maintain surge nodepool (
np-bg) with cluster autoscaler max surge to absorb blue/green double capacity.
GitOps & Promotions¶
- Argo CD per cluster watches environment repos:
apps/dev,apps/stg,apps/prod. - Promotions are PR-driven:
- Build → sign image (cosign) → update Helm chart values in
stg. - Run smoke + e2e; Argo Rollouts canary passes gates.
- Promote to
prod-ecs-green; automated canary. - Flip Envoy weights to 100% Green; archive Blue after bake.
- Build → sign image (cosign) → update Helm chart values in
Security & Secrets in Topology¶
- Azure Workload Identity for AKS ↔ AAD; pod-level identities fetch secrets via AKV CSI Driver.
- mTLS mesh (SPIFFE IDs) between services; Envoy
ext_authzto PDP. - Per-namespace NetworkPolicies and Azure NSGs block lateral movement.
DR & Failover¶
| Scenario | Action | RTO/RPO |
|---|---|---|
| Single AZ loss | Zone spread + PDB continue service | 0 / 0 |
| Single region impairment | AFD shifts to healthy region; CRDB serves local rows; WS/long-poll reconnect | RTO ≤ 5 min / RPO 0 |
| CRDB region outage | Surviving region serves all read requests; writes for tenants pinned to failed region are throttled unless policy allows re-pin | RTO ≤ 15 min / RPO ≤ 5 min (if re-pin) |
Runbook hooks
- Toggle read-only per tenant if their home region is down; optional temporary re-pin with audit.
Reference Manifests (snippets)¶
PodDisruptionBudget — Registry
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: pdb-ecs-registry, namespace: prod-ecs-green }
spec:
minAvailable: 70%
selector: { matchLabels: { app: ecs-registry } }
HPA — WS Bridge (connections)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: hpa-ws-bridge, namespace: prod-ecs-green }
spec:
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: ws_active_connections
target:
type: AverageValue
averageValue: "4000"
NetworkPolicy — namespace default deny
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: prod-ecs-green }
spec:
podSelector: {}
policyTypes: ["Ingress","Egress"]
Observability for Rollouts¶
- Rollout dashboards: error ratio, p95/p99 latency, WS reconnects, policy denials; burn-rate tiles during canary.
- Exemplars link from latency histograms to canary pod traces.
- AFD health + per-region SLIs visible to on-call; alerts wired to promotion pipeline.
Acceptance Criteria (engineering hand-off)¶
- AKS clusters provisioned in 2+ regions, 3 AZs each; nodepools & taints in place.
- Namespaces created with deny-all NetworkPolicies, SPIFFE/mTLS configured.
- Envoy Gateway and Gateway API configured with traffic split; Argo Rollouts installed and integrated with Prometheus SLO checks.
- Blue/green namespaces (
prod-ecs-blue,prod-ecs-green) workable; promotion playbooks done. - HPA/KEDA policies committed for Registry, WS Bridge, Orchestrator, Adapter Hub.
- AFD/WAF routes implemented with region & residency rules; health probes tied to Envoy readiness.
- DR drills documented: AZ loss, region failover, CRDB re-pin; measured RTO/RPO recorded.
Solution Architect Notes¶
- Prefer managed Redis and managed CockroachDB where available to simplify AZ operations.
- Keep AFD origin groups per region to avoid cross-region hairpin under partial failures.
- For high-value tenants, offer per-tenant canary labels (AFD header) to scope early traffic safely.
- Consider Gateway API + Envoy advanced LB for zone-local routing to reduce cross-AZ latency and cost.
CI/CD & IaC — repo layout, pipelines, artifact signing, Helm/Bicep/Pulumi, env promotion policies¶
Objectives¶
Provide a secure-by-default, GitOps-first delivery system for ECS that:
- Standardizes repo layout across services, SDKs, adapters, and infra.
- Ships reproducible builds with SBOM, signatures, and provenance.
- Uses Helm (K8s), Bicep (Azure), and optional Pulumi (multi-cloud/app infra).
- Enforces environment promotion policies (SoD, approvals, change windows) integrated with the Policy Engine.
Repository Strategy¶
Repos (hybrid)¶
- ecs-platform (monorepo): services (Registry, Policy, Orchestrator, WS Bridge, Adapter Hub), Studio BFF/UI, libraries.
- ecs-charts: Helm charts & shared chart library.
- ecs-env: GitOps environment manifests (Argo CD app-of-apps), per region/env.
- ecs-infra: Azure infra (Bicep modules), optional Pulumi program for cross-cloud.
- ecs-sdks: .NET / JS / Mobile SDKs.
- ecs-adapters: provider adapters (out-of-process plugins), conformance tests.
Rationale: app code evolves rapidly (monorepo aids refactors); environment state and cloud infra are separated with stricter review controls.
Monorepo layout (ecs-platform)¶
/services
/registry
/policy
/orchestrator
/ws-bridge
/adapter-hub
/libs
/common (logging, OTEL, resiliency, auth)
/contracts (gRPC/proto, OpenAPI)
/studio
/bff
/ui
/tools
/build (pipelines templates, scripts)
/dev (local compose, kind)
/.woodpecker | /.github | /azure-pipelines (CI templates)
/VERSION (semantic source of truth)
Build & Release Pipelines (templatized)¶
Pipeline stages (per service)¶
- Prepare: detect changes (path filters), restore caches (.NET/Node).
- Build: compile, unit tests, lint, license scanning.
- Security: SAST, dep scan (Dependabot/Snyk), container scan (Trivy/Grype).
- Package: container build (BuildKit), SBOM (Syft), provenance (SLSA v3+).
- Sign: cosign keyless (OIDC) or KMS-backed; attach attestations.
- Push: ACR (or GHCR) by immutable digest only.
- Deploy dev: update ecs-env/dev via PR (Argo CD sync).
- E2E/Contracts: run k6/gRPC contracts in dev namespace.
- Promote: open PRs to
stagingand thenprodwith gates (below).
flowchart LR
A[Commit/PR]-->B[CI Build+Test]
B-->C[Security Scans]
C-->D[Image+SBOM+Provenance]
D-->E[Cosign Sign]
E-->F[Push to ACR (by digest)]
F-->G[PR to ecs-env/dev]
G-->H[ArgoCD Sync + E2E]
H-->I{Gates Pass?}
I--Yes-->J[PR to ecs-env/staging -> canary]
J-->K[PR to ecs-env/prod -> blue/green]
I--No-->X[Fail + Rollback]
Example: GitHub Actions template (service)¶
name: ci-service
on:
push: { paths: ["services/registry/**", ".github/workflows/ci-service.yml"] }
pull_request: { paths: ["services/registry/**"] }
jobs:
build:
runs-on: ubuntu-22.04
permissions:
id-token: write # OIDC for cosign keyless
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with: { dotnet-version: "8.0.x" }
- uses: actions/cache@v4
with:
path: ~/.nuget/packages
key: nuget-${{ runner.os }}-${{ hashFiles('**/*.csproj') }}
- run: dotnet build services/registry -c Release
- run: dotnet test services/registry -c Release --collect:"XPlat Code Coverage"
- name: Build image
run: |
docker buildx build -t $IMAGE:$(git rev-parse --short HEAD) \
--build-arg VERSION=${{ github.ref_name }} \
-f services/registry/Dockerfile --provenance=true --sbom=true .
- name: SBOM (Syft)
run: syft packages dir:. -o spdx-json > sbom.json
- name: Push & Sign (cosign keyless)
env:
COSIGN_EXPERIMENTAL: "true"
IMAGE: ghcr.io/connectsoft/ecs/registry
run: |
docker push $IMAGE:$(git rev-parse --short HEAD)
DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' $IMAGE:$(git rev-parse --short HEAD))
cosign sign --yes $DIGEST
cosign attest --type slsaprovenance --predicate provenance.json $DIGEST
- name: Open PR to ecs-env/dev
run: ./tools/build/update-image.sh ecs-env dev registry $DIGEST
Tagging scheme
- Source:
vX.Y.Z(SemVer) in/VERSION - Image: digest authoritative; tags for traceability:
vX.Y.Z,sha-<7> - Chart:
appVersion: vX.Y.Z,version: vX.Y.Z+build.<sha7>
Artifact Integrity: SBOM, Signing, Admission¶
- SBOM: SPDX JSON produced at build; attached to image and stored in release assets.
- Signing:
cosignkeyless (Fulcio) by default; Enterprise can use KMS-backed keys. - Provenance: SLSA level 3/3+ with GitHub OIDC or Azure Pipeline OIDC attestations.
- Cluster admission: Kyverno/Gatekeeper policy requires:
- image pulled by digest,
- valid cosign signature from trusted issuer,
- SBOM label present (
org.opencontainers.image.sbom), - non-root, read-only FS, signed Helm chart (optional,
helm provenance).
Example Kyverno snippet
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: verify-signed-images }
spec:
rules:
- name: check-cosign
match: { resources: { kinds: ["Pod"] } }
verifyImages:
- imageReferences: ["ghcr.io/connectsoft/ecs/*"]
attestors:
- entries:
- keyless:
issuer: "https://token.actions.githubusercontent.com"
subject: "repo:connectsoft/ecs-platform:*"
Helm Delivery Model¶
- ecs-charts contains:
charts/ecs-svc(library chart with HPA, PDB, ServiceMonitor, NetworkPolicy, PodSecurity)charts/registry,charts/policy,charts/orchestrator, etc. (wrap library chart)
- Values layering:
values.yaml(defaults)values.dev.yaml/values.stg.yaml/values.prod.yaml(env)values.region.yaml(region overrides)
- Argo CD renders Helm with env/region values from ecs-env repo; blue/green namespaces are separate
Applicationobjects.
Chart values (excerpt)
image:
repository: ghcr.io/connectsoft/ecs/registry
digest: "sha256:abcd..." # immutable
resources:
requests: { cpu: "250m", memory: "512Mi" }
limits: { cpu: "1500m", memory: "1.5Gi" }
hpa:
enabled: true
targetInFlight: 400
pdb:
minAvailable: "70%"
otel:
enabled: true
Azure IaC with Bicep (ecs-infra)¶
- Modules per domain:
network,aks,acr,afd,asb,redis,kv,monitor. - Stacks per environment/region:
prod-weu,prod-neu,stg-weu, etc.
Bicep structure
/bicep
/modules
aks.bicep
acr.bicep
afd.bicep
asb.bicep
redis.bicep
kv.bicep
/stacks
prod-weu.bicep
prod-neu.bicep
stg-weu.bicep
Snippet: AKS with Workload Identity + CSI KeyVault
module aks 'modules/aks.bicep' = {
name: 'aks-weu'
params: {
clusterName: 'ecs-aks-weu'
location: 'westeurope'
workloadIdentity: true
nodePools: [
{ name: 'system', vmSize: 'Standard_D4s_v5', mode: 'System', zones: [1,2,3] }
{ name: 'services', vmSize: 'Standard_D8s_v5', count: 6, zones: [1,2,3] }
]
addons: { omsAgent: true, keyVaultCsi: true }
}
}
Policy as Code
- Bicep lint + OPA/Conftest gate: deny public IPs on private services, enforce HTTPS, AKS RBAC, diagnostic settings, encryption.
Drift & Cost
- Nightly
what-ifwith approval gates. - Infracost checks on PR for FinOps visibility.
Pulumi Option (app/edge infra or multi-cloud)¶
When multi-cloud or richer composition is needed, Pulumi TS/Go program can orchestrate:
- AFD routes, DNS, CDN origins,
- Argo CD apps (via Kubernetes provider),
- Cross-cloud Secrets and KMS bindings.
Pulumi TS excerpt
import * as k8s from "@pulumi/kubernetes";
const appNs = new k8s.core.v1.Namespace("prod-ecs-green", { metadata: { name: "prod-ecs-green" }});
new k8s.helm.v3.Chart("registry",
{ chart: "registry", repo: "oci://ghcr.io/connectsoft/ecs-charts",
namespace: appNs.metadata.name,
values: { image: { digest: process.env.IMAGE_DIGEST } }});
Choice: Bicep for Azure account-level infra; Pulumi for higher-level orchestration (optional).
Environment Promotion Policies (gates)¶
Policy Integration
- Stg→Prod promotions require PDP
decide(operation=deploy.promote):- SoD enforced (author ≠ approver),
- Risk score below threshold or 2 approvals,
- Change window active for prod,
- Evidence: test results, SLO canary checks green.
Automation
- PR from bot updates ecs-env
staging→ triggers Argo Rollouts canary. - Upon success, bot opens
prodPR with:- Helm value digest pin,
- Rollout strategy set (weights, analysis templates),
- Policy check job posts PDP outcome in PR status.
Required checks on prod PR
- E2E suite ✅
- SLO canary analysis ✅
- Policy PDP decision ✅
- Security attestations (cosign verify, SBOM present) ✅
- Manual approval (Approver role) ✅
Change freeze
prodbranch protected by freeze label; PDP enforces schedule; override requires break-glass with incident ticket.
Preview Environments¶
- On PR, create ephemeral namespace
pr-<nr>with limited quotas:- Deploy changed services + Studio UI, seeded with masked sample data.
- TTL controller cleans after merge/close.
- URL:
https://pr-<nr>.dev.ecs.connectsoft.io.
Rollback Strategy¶
- Service rollback: revert env manifest to previous digest; Argo CD sync; keep blue namespace warm.
- Config rollback: flip alias to previous snapshot (constant-time).
- Automated: if canary SLOs fail, Rollouts auto-rollback and block prod PR.
Secrets & Credentials in CI/CD¶
- CI uses OIDC workload identity to:
- obtain short-lived push token for ACR,
- request cosign keyless cert,
- fetch non-production secrets from KV via federated identity.
- No long-lived secrets in CI; repo has secret scanning enforced.
Observability of Delivery¶
- Pipelines emit OTEL spans:
ecs.ci.build,ecs.ci.scan,ecs.cd.sync,ecs.cd.promote. - Metrics: build duration, success rate, MTTR for rollback, mean lead time.
- Dashboards: Delivery (DORA) + Supply Chain (signature verification %, SBOM coverage).
Acceptance Criteria (engineering hand-off)¶
- Repos created with scaffolds & templates; path-filtered CI in place.
- CI templates produce signed images, SBOM, SLSA provenance, push by digest.
- Kyverno/Gatekeeper admission verifying signatures & digest pinning.
- ecs-charts library and per-service charts published; ecs-env Argo apps wired for dev/stg/prod (blue/green).
- Bicep modules and environment stacks committed; what-if + Conftest on PR; Infracost enabled.
- Promotion gates integrated with PDP decisions, SLO canary checks, and manual approvals.
- Preview environments auto-spawn for PRs; TTL cleanup automated.
- Runbooks for failed canary, admission rejection, rollback, infra drift.
Solution Architect Notes¶
- Keep digest-only deploys non-negotiable; tags for humans, digests for machines.
- Prefer keyless signing to reduce key mgmt overhead; provide KMS fallback for regulated tenants.
- Centralize pipeline templates; services import them with minimal YAML.
- Consider Argo CD Image Updater only for lower environments; prod should remain PR-driven with explicit digests.
- Reassess scan noise quarterly; failing the build on high CVSS after a grace period keeps the supply chain healthy.
Operational Runbooks — on-call, incident playbooks, hotfix flow, config rollback drill, RTO/RPO¶
On-Call Model¶
Coverage & Roles¶
- 24×7 follow-the-sun rotations per geo (AMER / EMEA / APAC) with a global Incident Commander (IC) pool.
- Roles per incident:
- Incident Commander (IC) — owns timeline/decisions, delegates.
- Ops Lead — drives technical mitigation (AKS, Envoy, Redis, CRDB).
- Service SME — Registry/PDP/Orchestrator/Adapters/Studio.
- Comms Lead — customer/internal comms; status page.
- Scribe — live notes, artifacts, evidence pack.
- Security On-Call — joins if policy/access/secrets involved.
Handoff Checklists¶
Start of Shift
- Open NOC dashboard, confirm SLO tiles green for Resolve/Publish/PDP.
- Verify Pager healthy; run “test page” to self (silent).
- Confirm last handoff notes + open actions.
- Review scheduled change windows and release plans.
End of Shift
- Update handoff doc with:
- Open incidents, mitigations, remaining risk.
- Any temporary overrides (rate limits, feature flags).
- Pending hotfix/canary states.
Severity & Response Targets¶
| SEV | Definition | Examples | Target TTA | Target TTR |
|---|---|---|---|---|
| SEV-1 | Major outage / critical SLO breach | Resolve 5xx>5%, region down, data integrity risk | ≤ 5m | ≤ 60m |
| SEV-2 | Partial degradation / single-tenant severe | 429% for top tenant >10%, propagation lag p95>15s | ≤ 10m | ≤ 4h |
| SEV-3 | Minor impact / at risk | Canary failing, rising error trend, DLQ growth | ≤ 30m | ≤ 24h |
IC may escalate/downgrade; Security events can be SEV-1 regardless of scope.
Standard Incident Playbook (applies to all)¶
- Declare severity, assign roles, open incident channel & ticket.
- Stabilize (protect SLOs): enable rate-limits, switch to long-poll, open breakers per-tenant before global, serve stale data if allowed.
- Diagnose via run-of-show:
- Check golden signals dashboards (error %, p95/p99 latency, throughput).
- Jump to exemplar traces from SLO tiles.
- Inspect recent deploys (Argo/PR timestamps), feature flags, policy changes.
- Mitigate using scenario runbooks below.
- Communicate:
- Internal: every 15m or on change of state.
- External: initial notice ≤ 15m for SEV-½; updates 30–60m.
- Recover to steady state; back out temporary overrides.
- Close with preliminary impact, customer list, follow-ups.
- Post-Incident Review within 3–5 business days (template at end).
Scenario Runbooks (diagnostics ➜ actions ➜ exit)¶
RB-S1 Resolve Latency / Error Spike¶
Trigger: p99 Resolve > 400 ms or error% > 1% (5m).
Diagnostics
- Dashboards: Resolve, Redis, CRDB, Gateway.
- Check
cache_hit_ratio,coalesce_ratio,waiter_pool_size. - Compare last deploy & policy changes.
Actions
- Raise in-flight HPA target temporarily (+25–50% pods).
- If Redis p95>3 ms or evictions>0: scale shards; enable SWR 2s.
- If hot tenant: lower tenant RPS; increase poll interval via policy obligation.
- If WS unstable: downgrade to long-poll.
Exit
- p99 ≤ 250–300 ms for ≥ 10m; error% ≤ 0.2%; roll back any temporary throttles gradually.
RB-S2 Propagation Lag / DLQ Growth¶
Trigger: propagation p95>8 s (15m) or DLQ>100 (10m).
Diagnostics
- Orchestrator consumer lag, bus quotas,
adapter_throttle_total.
Actions
- Scale KEDA consumers; increase batch sizes.
- Replay DLQ after confirming transient errors.
- If adapter throttled (Azure/AWS): lower batch per binding, increase backoff to 60 s.
- Limit publish waves (canary ➜ regional).
Exit
- Lag p95 ≤ 5 s; DLQ drained to steady baseline; no re-enqueue loops.
RB-S3 PDP / AuthZ Degradation¶
Trigger: PDP decision p99>20 ms, timeouts at edge, spike in denied decisions.
Diagnostics
- PDP latency and cache hits, policy bundle updates, gateway
ext_authzerrors.
Actions
- Scale PDP; warm cache by preloading effective policies.
- If PDP unreachable: edge uses last-good TTL ≤ 60 s; deny admin ops.
- Roll back recent policy bundle if regression suspected.
Exit
- p95 ≤ 5 ms; no authn/z timeouts; revert temp TTLs.
RB-S4 Region Impairment¶
Trigger: AFD probes failing for region; rising cross-zone errors.
Actions
- Fail traffic to healthy region via AFD; pause deploys.
- Switch affected tenants to read-only if their CRDB home region is down.
- Consider temporary re-pin of tenant data (follow CRDB playbook).
Exit
- Region restored; backroute to local; resume deploys after smoke tests.
RB-S5 Adapter / Provider Throttling¶
Trigger: adapter_throttle_total>0, provider 429s.
Actions
- Reduce per-binding concurrency & batch size; exponential backoff (up to 60 s).
- Mark binding degraded; notify tenant; queue ops for replay post-clear.
Exit
- No throttles for 30m; backlog cleared; re-enable normal concurrency.
RB-S6 Security Signal (Break-Glass / Suspicious Access)¶
Trigger: break-glass token used; abnormal policy denials; spike in auth failures.
Actions
- IC includes Security On-Call; elevate to SEV-1 if needed.
- Freeze non-essential changes; enable stricter PDP deny-by-default for admin ops.
- Export evidence pack; verify audit hash chain; rotate credentials if necessary.
Exit
- Root cause identified, evidence packaged, access reviewed & restored.
Hotfix Flow (code defects)¶
When to hotfix: Reproducible bug in latest release that materially impacts SLOs and cannot be mitigated by config/policy.
Steps
- Branch from last prod tag:
release/vX.Y.Z-hotfix-N. - Minimal, targeted change + unit tests; bump patch version.
- CI: build ➜ SBOM ➜ sign ➜ push digest.
- Stage to staging with canary (5% ➜ 25% ➜ 50%); auto SLO analysis gates.
- Change window: If closed, IC invokes PDP override (audited).
- Promote to prod-green via Argo Rollouts canary; bake 10–15m.
- Flip traffic to green; keep blue warm for immediate rollback.
- Open follow-up PR to main with the same patch; prevent drift.
Backout
- Rollouts auto-rollback on SLO gate failure.
- Manual: set weights to 0/100 (green/blue), revert env manifest to previous digest.
Config Rollback Drill (blue/green alias)¶
Prereqs
- Current alias
prod-current➜ Blue snapshotv1.8.3. - Candidate Green
v1.9.0already validated.
Drill (quarterly, scripted)
- Announce start in #ops; open mock incident ticket.
- Flip alias:
ecsctl alias set --tenant t-123 --env prod --alias prod-current --version v1.9.0 - Verify:
- Registry emits
ConfigPublished; Orchestrator fan-out started. - Sample SDK resolves show new ETag; propagation p95 ≤ 5 s.
- Registry emits
- Synthetic checks: canary services run health probes against critical keys.
- Rollback:
ecsctl alias set --tenant t-123 --env prod --alias prod-current --version v1.8.3- Confirm re-propagation; no errors.
- Record RTO for both directions; attach evidence (events, traces, timings).
- Reset to production state; close ticket with drill metrics.
Success criteria
- Cutover and rollback each ≤ 2 minutes end-to-end; zero 5xx spikes; no policy denials.
RTO / RPO Objectives¶
| Scenario | Service/Data | RTO | RPO | Notes |
|---|---|---|---|---|
| Single pod/node loss | All stateless | < 2m | 0 | HPA + PDB recover |
| AZ loss | All | < 5m | 0 | Multi-AZ; zone spread |
| Region impairment | Resolve (read) | < 5m | 0 | AFD failover; long-poll reconnect |
| Region impairment | Writes (tenant home region pinned) | < 15m | ≤ 5m* | *If temporary re-pin of tenant rows |
| Redis shard failover | Cache | < 30s | 0 | Serve stale (SWR), bypass Redis |
| PDP degraded | Decisions | < 5m | 0 | Last-good cache ≤ 60s |
| CRDB restore (logical) | Snapshots/audit | ≤ 60m | ≤ 5m | Point-in-time + hourly export |
| Object store loss (warm tier) | Audit exports | ≤ 4h | ≤ 24h | Rebuild from hot + backups |
Communications Templates¶
Initial External (SEV-½)
We’re investigating degraded performance impacting configuration reads for some tenants starting at HH:MM UTC. Mitigations are in progress. Next update in 30 minutes. Reference: INC-####.
Update
Mitigation active (traffic shifted to Region X). Metrics improving; monitoring continues. ETA to resolution N minutes.
Resolved
Incident INC-#### resolved at HH:MM UTC. Root cause: …. We’ll publish a full review within 5 business days.
Post-Incident Review (PIR) Template¶
- Summary: what/when/who/impact scope.
- Customer impact: symptoms, duration, tenants affected.
- Timeline: detection ➜ declare ➜ mitigation ➜ recovery.
- Root cause analysis: technical + organizational.
- What worked / didn’t: detection, runbooks, tooling.
- Action items: owners, due dates (prevent/mitigate/detect).
- Evidence pack: dashboards, traces, logs, audit export.
- Policy updates: SLO/SLA, change windows, guardrails.
Toolbelt (quick refs)¶
- Rollouts & traffic
kubectl argo rollouts get rollout registry -n prod-ecs-greenkubectl argo rollouts promote registry -n prod-ecs-green
- Envoy/Gateway
kubectl get httproute -A | grep registry- Adjust weights via Helm values PR or emergency patch (IC approval required).
- ECSCTL
ecsctl alias set …(cutover/rollback)ecsctl refresh broadcast --tenant t-123 --env prod --prefix features/ecsctl audit export --tenant t-123 --from … --to …
- KEDA & Bus
kubectl get scaledobject -Aaz servicebus queue show … --query messageCount
All emergency patches must be reconciled back to Git within the incident window.
Drills & Readiness¶
- Monthly: Config rollback drill (per major tenant).
- Quarterly: Region failover gameday; Redis shard failover; adapter throttling simulation.
- Semi-annual: Full disaster recovery restore test (CRDB PITR + audit verify).
- Track drill RTO/RPO and MTTR trends on Ops dashboard.
Acceptance Criteria (engineering hand-off)¶
- On-call rota, playbooks, and comms templates published in the Ops runbook repo.
- Pager integration wired to SLO burn-rate and key symptom alerts.
- “Big Red Button” actions scripted: alias flip, WS➜poll downgrade, tenant rate-limit override.
- Drill automation scripts (
ecsctl, Helm helpers) committed and documented. - PIR template enforced; incidents cannot be closed without action items and owners.
- RTO/RPO objectives encoded in DR test plans with last measured values.
Solution Architect Notes¶
- Favor per-tenant containment (rate limits, breakers, bindings) to preserve global SLOs.
- Keep mitigation > diagnosis bias in the first 10 minutes; restore service, then dig deep.
- Continue enriching runbooks with direct links to dashboards and ready-to-run commands for your platform.
- Measure runbook MTTA (time-to-action) during drills; shorten with automation and safe defaults.
Business Continuity & DR — geo-replication, failover orchestration, drills, compliance evidence¶
Objectives¶
Design and prove a business-continuity strategy that keeps ECS available through AZ/region failures while preserving data integrity, tenant isolation, and regulatory evidence. Define geo-replication, orchestrated failover/failback, regular drills, and auditable proof of meeting RTO/RPO.
Continuity Posture at a Glance¶
| Layer | Strategy | RTO | RPO | Notes |
|---|---|---|---|---|
| API/Gateway | Active-active across ≥2 regions (AFD + Envoy) | ≤ 5 min | 0 | Health-based routing; sticky to nearest allowed region |
| Config Registry data (CRDB) | Multi-region cluster, REGIONAL BY ROW (tenant home region) + PITR | ≤ 15 min (with re-pin) | ≤ 5 min* | *0 if home region up; ≤5 min on temporary re-pin |
| Cache (Redis) | Regional clusters, cache as disposable + (Ent: geo-replication) | ≤ 30 s | 0 | Serve stale (SWR) on shard fail; warm on failover |
| Events (Service Bus) | Premium namespaces per geo + DR alias pairing | ≤ 10 min | ≤ 1 min | Alias flip; DLQ preserved |
| Studio/Static | Multi-origin CDN (blue/green buckets) | ≤ 5 min | 0 | Immutable assets |
| Audit & Exports | Hot in CRDB + nightly Parquet to regional object store | ≤ 60 min | ≤ 24 h (cold) | Hot data meets app RPO; archive meets compliance |
| Policy Bundles | Multi-region PDP cache + signed bundles | ≤ 5 min | 0 | Edge “last-good” TTL ≤ 60s |
Geo-Replication Topology¶
flowchart LR
AFD[Azure Front Door + WAF] --> RG1[Envoy/Gateway - Region A]
AFD --> RG2[Envoy/Gateway - Region B]
subgraph Region A
REG1[Registry]-->CRDB[(CockroachDB)]
PDP1[PDP]
ORC1[Orchestrator]-->ASB1[(Service Bus A)]
WS1[WS/LP Bridge]
REDIS1[(Redis A)]
end
subgraph Region B
REG2[Registry]-->CRDB
PDP2[PDP]
ORC2[Orchestrator]-->ASB2[(Service Bus B)]
WS2[WS/LP Bridge]
REDIS2[(Redis B)]
end
ASB1 <--DR alias--> ASB2
CRDB --- Multi-Region Replication --- CRDB
Key design points
- Stateless services are active in all regions; sessionless by design.
- CRDB hosts a single logical cluster spanning regions; tenant rows are homed to a region for write locality; reads are global.
- Redis is regional. On failover, caches are rebuilt; no cross-region consistency needed.
- Service Bus uses Geo-DR alias; producers/consumers bind via alias, not direct namespace name.
Failover Orchestration (Region)¶
Decision Ladder¶
- Detect: AFD health probes red OR SLO burn-rate breach OR operator declares.
- Decide: IC invokes RegionFailoverPlan (automated policy: partial or full).
- Act: Route, Data, Events steps (below) in order.
- Verify: SLOs green, data path healthy, backlog drained.
- Communicate: status page + tenant comms.
- Recover/Failback: after root cause fixed and consistency verified.
Orchestration Steps (automated runbook)¶
sequenceDiagram
participant Mon as Monitor/SLO
participant IC as Incident Cmd
participant AFD as Azure Front Door
participant GW as Envoy/Gateway
participant BUS as Service Bus DR
participant REG as Registry
participant PDP as Policy
participant CRDB as CockroachDB
Mon-->>IC: Region A unhealthy / SLO burn
IC->>AFD: Disable Region A origins; 100% to Region B
IC->>BUS: Flip Geo-DR alias to Namespace B
IC->>REG: Set tenant mode = read-only for homeRegion=A
alt Extended outage > X min (policy)
IC->>CRDB: Re-pin affected tenants to Region B (scripted)
REG-->>PDP: Emit TenantRehomed obligations
end
IC->>REG: Trigger warm-up (keyheads) + Refresh broadcast
Mon-->>IC: SLOs recovered
IC->>Comms: Resolved update; start failback plan (separate window)
Controls
- Read-only mode: protects writes for tenants whose home region is down; SDKs continue reads via long-poll.
- Tenant re-pin (optional): migrate leaseholders/zone configs for the tenant partitions to Region B; audited and reversible.
Data Protection & Recovery¶
- Backups: CRDB full weekly + incremental hourly; PITR ≥ 7 days.
- Restore drills: Quarterly logical restores into staging namespace; verify checksums and lineage hashes.
- Audit chain: hash-chained audit events verified post-restore to prove integrity.
- Object store: nightly Parquet exports (signed manifest for Ent); regional buckets aligned to residency.
Eventing Continuity (Service Bus)¶
- Producer/consumer endpoints use DR alias name.
- Failover flips alias from Namespace A → B; consumers reconnect automatically.
- Ordering: not guaranteed across flip; idempotency keys protect replays.
- DLQ in active namespace is exported pre-flip and re-enqueued post-flip.
Redis Strategy¶
- Treat as ephemeral: no cross-region state transfer required.
- On failover:
- Serve stale for ≤ 2 s (SWR) during shard promotions.
- Fire pre-warm job: touch common keyheads (per tenant/app).
- Observe hit ratio; scale shards if p95 latency > 3 ms or evictions > 0.
Failback (Return to Steady State)¶
- Health gate: region green ≥ 60 min; root cause fixed.
- Data: if tenants re-pinned, choose stay or migrate back (off-hours).
- Traffic: gradually re-enable AFD weight to original distribution.
- Events: DR alias back to primary namespace; drain backlog; compare counts.
- Post-ops: remove temporary overrides; finalize incident & PIR.
DR Exercises & Drill Catalog¶
| Drill | Scope | Frequency | Success Criteria |
|---|---|---|---|
| AZ Evacuation | Evict a zone; validate PDB, no SLO breach | Quarterly | p95 latency steady; 0 error spikes |
| Region Failover | Full traffic shift + Bus alias flip | Quarterly | RTO ≤ 5 min; Propagation p95 ≤ 5 s |
| Tenant Re-Pin | Move sample tenants home region | Semi-annual | RPO ≤ 5 min; no cross-tenant impact |
| CRDB PITR Restore | Point-in-time restore of config set | Semi-annual | Data matches lineage hash; audit chain verifies |
| Redis Shard Loss | Kill primary; observe recovery | Quarterly | RTO ≤ 30 s; SWR served |
| Policy/PDP Degrade | Simulate outage | Quarterly | Edge uses last-good; admin ops denied |
Automation
ecsctl dr plan <region>prints executable plan with guardrails.- Chaos Mesh/Envoy faults wired to rehearsal scripts.
- Drill evidence exported automatically (metrics, traces, manifests).
Runbooks (excerpts)¶
RB-DR1 Region Failover (operator-driven)¶
- Declare SEV-1, assign roles.
afdctl region disable --name weu(AFD origin group)ecsctl bus dr-flip --alias ecs-events --to neuecsctl tenant set-mode --home weu --mode read-only- If outage > X min:
ecsctl tenant repin --tenant t-* ecsctl cache warmup --region neu --tenant-sample 100- Verify: Resolve p95, error%, propagation, PDP latency.
- Comms: status page update; every 30m until resolved.
RB-DR2 Failback¶
- Confirm region healthy 60m; run smoke tests.
- If re-pinned: schedule change window; repin back or keep.
- Flip AFD back; bus alias flip; validate metrics.
- Close incident; attach evidence pack.
Compliance Evidence (SOC 2 / ISO 27001)¶
Artifacts produced automatically per drill or real event
- DR Runbook Execution Log: timestamped steps, operator IDs, commands, and results.
- SLO Evidence: before/after p95/p99, error %, burn-rate graphs with exemplars.
- Data Integrity Proof: audit hash-chain verification report; CRDB backup manifest & PITR point.
- Change Records: AFD config deltas, Bus alias flips, tenant mode changes (all audited).
- Post-Incident Review: root cause, corrective actions, owners & due dates.
Controls mapping
- A17 (ISO 27001) / CC7.4 (SOC 2): BC/DR plan, tested with evidence.
- A12 / CC3.x: Backups & restorations validated and logged.
- A.5 / CC6.x: Roles & responsibilities during incidents clearly defined.
Observability for BC/DR¶
- Spans:
ecs.dr.failover.plan,ecs.dr.failover.execute,ecs.dr.failback.execute. - Metrics:
dr_failover_duration_ms,dr_events_replayed_total,tenants_rehomed_total,afd_origin_healthy. - Alerts:
- Region health red and p95 > 300 ms → Page.
- DR alias mismatch vs intended → Warn.
audit_chain_verify_failures_total > 0after restore → Page.
Guardrails & Safety Checks¶
- Dry-run for all DR commands (diff/preview).
- Two-person rule for tenant re-pin and bus alias flips.
- Change windows enforced for failback; failover exempt under SEV-1.
- All actions idempotent and audited with correlation IDs.
Acceptance Criteria (engineering hand-off)¶
- AFD, Envoy, Service Bus aliasing, and CRDB multi-region configured per topology.
ecsctlDR subcommands implemented (disable/enable region, bus DR flip, tenant mode, re-pin, cache warmup) with dry-run.- Runbooks RB-DR1/RB-DR2 published; drill automation in CI (staging).
- Quarterly region failover drill executed with recorded RTO/RPO; evidence pack generated and stored.
- CRDB PITR enabled; restore procedure validated and documented.
- Alerts and dashboards for DR state live; incident templates linked to BC/DR plan.
Solution Architect Notes¶
- Keep tenant re-pin rare and audited; prefer read-only until primary region stabilizes.
- Treat Redis as rebuildable; don’t pay cross-region cache tax unless a clear SLA needs it.
- For high-regulation tenants, offer enhanced continuity (dual-home with stricter quorum) as an Enterprise add-on.
- Rehearse communications as much as technology—clear, timely updates reduce incident impact for customers.
Readiness & Handover — quality gates, checklists, PRR, cutover plan, training notes for Eng/DevOps/Support¶
Objectives¶
Ensure ECS can be safely operated in production by codifying quality gates, pre-flight checklists, a Production Readiness Review (PRR), the cutover run-of-show, and role-specific training for Engineering, DevOps/SRE, and Support. Outputs are actionable, auditable, and aligned to ConnectSoft standards (Security-First, Observability-First, Clean Architecture, SaaS multi-tenant).
Quality Gates (build → deploy → operate)¶
| Gate | Purpose | Evidence / Automation | Blocker if Fails |
|---|---|---|---|
| Supply Chain | Provenance & integrity | Cosign signatures verified in admission; SBOM attached; image pulled by digest | ✅ |
| Security | Hardening & secrets | SAST/dep scan=Clean or accepted; container scan=No High/Critical; mTLS on; secret refs enforced | ✅ |
| Contracts | API/grpc stability | Backward-compat tests; schema compatibility; policy DSL parser tests | ✅ |
| Performance | Meet p95/p99 | Resolve p99 ≤ targets; publish→refresh p95 ≤ SLO; capacity headroom ≥ 2× | ✅ |
| Resiliency | Degrade gracefully | Retry/circuit/bulkhead tests; chaos suite C-1…C-7 green | ✅ |
| Observability | Trace/metrics/logs ready | OTEL spans present; dashboards published; SLO alerts wired; exemplars linkable | ✅ |
| Compliance | Audit, PII, SoD | Audit hash chains; SoD enforced; retention set; evidence pack job runs | ✅ |
| FinOps | Cost guardrails | Usage meters emitting; budgets configured; anomaly jobs active | ⚠ (warn) |
| Ops Runbooks | Operability | Incident & DR runbooks in repo; drills executed within last 90d | ✅ |
| Docs & Support | Handover complete | Knowledge base articles; FAQ; escalation tree | ⚠ |
All ✅ gates are mandatory for promotion from staging → production.
Production Readiness Review (PRR)¶
Owner: Solution Architect (chair) + SRE + Security + Product. Format: single session with artifact walkthrough; outcomes recorded.
PRR Checklist (submit 48h before meeting)¶
- Architecture & Risks
- Service diagrams current; ADRs for key decisions
- Threat model updated; mitigations tracked
- Security & Compliance
- SBOM, signatures, provenance attached to images
- Pen-test/DAST findings triaged (no High open)
- Audit pipeline verified; daily manifest signing (if Ent)
- PII posture reviewed; DSAR playbook linked
- Performance & Scale
- Load test report (p99, throughput, coalescing ratio)
- HPA/KEDA policies reviewed; surge plan validated
- Redis sizing workbook & eviction alerts validated
- Resiliency & DR
- Chaos results (C-1..C-7) with pass/fail & action items
- Region failover drill RTO/RPO evidence
- Observability
- SLOs encoded; burn-rate alerts live
- Dashboards: SRE, Tenant Health, Security, Adapters
- Log redaction tests ✅
- Operations
- On-call rota & paging configured; escalation tree
- Runbooks: latency spike, DLQ growth, authZ degrade
- Hotfix flow rehearsed; rollback scripts (
ecsctl) present
- Change Management
- Policy packs reviewed (change windows, approvals)
- CAB sign-off (if required)
- Documentation & Support
- Admin/tenant guides, SDK quickstarts, Studio user guide
- Support KB: top 20 issues + macros
- Go/No-Go
- Go-criteria met (see table)
- Backout plan validated
Go / No-Go Criteria¶
| Area | Go Criteria | Evidence |
|---|---|---|
| Availability | Resolve availability ≥ 99.95% in staging week | SLO metric export |
| Latency | Resolve p99 ≤ 150 ms (hit) / ≤ 400 ms (miss) | Perf report + dashboards |
| Security | 0 High/Critical vulns exploitable; mTLS enabled | Scan reports, config |
| DR | Region failover RTO ≤ 5 min; PITR demo ≤ 60 min | Drill logs + evidence pack |
| Obs | All golden dashboards present; alerts firing in test | Screenshots/links |
| Ops | On-call & comms templates finalized | Pager config + docs |
| Cost | Budgets & alerts active; forecast within plan | FinOps dashboard |
Pre-Flight Checklists (by domain)¶
Platform (once per region)¶
- AFD origins healthy; WAF policies applied
- Envoy routes with ext_authz, RLS, traffic splits
- AKS nodepools spread across 3 AZs; PDBs applied
- Argo CD and Rollouts healthy; image digests pinned
- KEDA & HPA metrics sources ready (Prometheus)
Data & Storage¶
- CockroachDB multi-region up; leaseholder locality by tenant
- PITR window configured; backup jobs green
- Redis cluster shards steady; no evictions; latency p95 < 3 ms
Eventing & Adapters¶
- Service Bus DR alias validated (flip test in staging)
- DLQs empty; replay tested
- Adapter credentials & quotas verified; watch cursors in sync
Security¶
- JWKS rotation tested; last rotation < 90d
- Secret ref resolution (
kvref://) e2e test passes - Admission policies (cosign verify, non-root) enforced
Observability¶
- OTEL collector pipelines tail-sampling active
- SLO burn-rate alerts mapped to PagerDuty
- Audit chain verification job succeeded in last 24h
Cutover Plan (run-of-show)¶
Timeline (example anchors)¶
- T-7d: Staging canary passes SLO gates; PRR complete; tenant comms drafted
- T-1d: Freeze non-essential changes; final data validation; backup checkpoint
- T-0: Production green namespace deployed; 5% → 25% → 50% canary with analysis
- T+30–60m: Flip traffic 100% to green; keep blue hot for rollback (2–24h window)
- T+24h: Post-cutover review; decommission blue after sign-off
Roles¶
- IC (lead), Ops Lead, Service SME, Comms Lead, Scribe
Steps (T-0 detailed)¶
- Announce start; open bridge & incident room (change record)
- Verify prerequisites (see Pre-Flight) and go/no-go check
- Argo Rollouts canary start; watch SLO analysis; hold at 50% for 10–15m
- Promote to 100%; observe p95/p99, error %, propagation lag
- Trigger config canary on limited tenants (
prod-nextalias) if planned - Flip alias to prod-current; validate refresh events; sample SDK resolve shows new ETag
- Hypercare window: monitor, keep blue ready; publish customer update
- Close with initial success metrics; schedule 24h review
Backout (any time)¶
- Service: set weights to 0/100 (green/blue), sync prev digest
- Config: revert alias pointer to last known good version
- Comms: issue backout notice; continue root-cause work
Handover Package (what Engineering delivers to Ops/Support)¶
| Artifact | Description |
|---|---|
| Service Runbook | Start/stop, health checks, dependencies, common faults |
| Resiliency Profile | Timeouts, retries, backoff, bulkheads, circuits (effective values) |
| SLO/SLA Sheet | SLIs, targets, alert policies, escalation paths |
| Dashboards Index | Links to Golden Signals, Tenant Health, Security, Adapters |
| Config Reference | Default values, schema refs, policy overlays per edition |
| Playbooks | Incident scenarios, DR steps, hotfix flow |
| Release Notes | Known issues, feature flags, migration notes |
| Test Evidence | Perf, chaos, DR drills, security scans |
| Compliance Evidence | Audit manifest sample, retention policy, PII posture |
| Contact Matrix | SMEs by component, backup ICs, vendor contacts |
All artifacts live in ops/runbooks, ops/evidence, and tenant-safe copies in Support KB.
Training Notes (role-based)¶
Engineering¶
- Goal: Own code in prod responsibly.
- Modules
- Architecture & domain model (2h)
- Resiliency & chaos toolkit (90m)
- Observability deep-dive: tracing taxonomies, SLOs (90m)
- Secure coding + secret refs (60m)
- Release & hotfix procedures (60m)
- Labs
- Break and fix: inject 200 ms Redis latency → verify degrade path
- Canary failure drill with Argo Rollouts
DevOps / SRE¶
- Goal: Operate at SLO; minimize MTTR.
- Modules
- DR & failover orchestration (2h)
- Capacity & autoscaling (HPA/KEDA) (90m)
- FinOps dashboards & budgets (60m)
- Compliance evidence generation (45m)
- Labs
- Region failover simulation; measure RTO/RPO
- DLQ growth → replay & recovery
Support / CSE¶
- Goal: First-line diagnosis & clear comms.
- Modules
- Studio & API basics; common errors (90m)
- Reading dashboards; when to escalate (60m)
- Tenant cost & quota coaching (45m)
- Macros & KB
- 304 vs 200 ETag explainer, rate-limit guidance, policy denial decoding, rollback steps
- Escalation
- SEV thresholds, IC paging, evidence to capture (tenant, route,
x-correlation-id)
- SEV thresholds, IC paging, evidence to capture (tenant, route,
Day-0 / Day-1 / Day-2 Operations¶
- Day-0 (Cutover): execute plan; hypercare (24h); record metrics snapshot
- Day-1 (Stabilize): finalize blue tear-down; adjust HPA/KEDA; clear temporary overrides
- Day-2+ (Operate): weekly perf+cost review; monthly rollback drill; quarterly DR/chaos gameday
Checklists (copy-paste ready)¶
Go/No-Go (final 1h)
- All SLO tiles green 30m
- No High/Critical CVEs in diff since staging
- DR alias & AFD health verified
- Pager test fired to on-call
- Backout commands pasted & dry-run output saved
Post-Cutover (T+60m)
- Error% ≤ baseline; p95/p99 ≤ targets
- Propagation p95 ≤ 5s; DLQ normal
- Customer comms sent; status page updated
- Blue kept hot; timer set for 24h review
Acceptance Criteria (engineering hand-off)¶
- PRR completed with Go decision; artifacts archived.
- All Quality Gates enforced in CI/CD and verified in admission.
- Pre-flight checklists executed and recorded per region.
- Cutover plan executed with evidence bundle (metrics, traces, audit exports).
- Handover package delivered to Ops/Support; training sessions completed and tracked.
- Backout tested; RTO/RPO measured and logged.
Solution Architect Notes¶
- Keep runbooks short and command-first; link deep docs rather than embedding.
- Treat training as a product: measure retention via quarterly drills and refreshers.
- Automate evidence packs after PRR, cutover, and drills—compliance should be a by-product of good operations.