🛠 External Configuration Server¶

🎯 Vision¶

The External Configuration Server (ECS) will serve as a centralized, multi-tenant, secure, and dynamic configuration service for all microservices in the ConnectSoft AI Factory ecosystem. It allows runtime reconfiguration without redeployment, ensures tenant isolation, and enforces policy-driven overrides across editions and environments.

ECS transforms configuration from a static file-driven activity → into a policy-aware, observable, and externalized runtime service.

🌍 Problem Context¶

Traditional microservices suffer from:

📄 Config files baked into deployments (requiring restarts on change)
🌐 Scattered config sources (env vars, files, feature flags, secrets)
❌ No central observability (difficult to trace config changes)
🧩 Multi-tenant conflict (different tenants require different policies)

In a factory-scale system with 3000+ agents and microservices, this creates risk, inconsistency, and operational drag.

🧩 Purpose of ECS¶

ECS is designed to:

Objective	ECS Responsibility
Centralize configuration	Single source of truth for runtime and environment variables
Enable dynamic updates	Push config changes at runtime without redeployment
Enforce multi-tenancy & editions	Per-tenant and per-edition overrides with RBAC enforcement
Ensure observability	Full audit trail, metrics, and tracing of configuration requests & mutations
Integrate with ecosystem	Provide APIs, gRPC, and event-driven notifications for microservices and agents

🏗 Position in Architecture¶

ECS acts as an infrastructure microservice in the ConnectSoft ecosystem. It is consumed by:

All domain microservices → for runtime configs
Architect Agents → for blueprint-level config references
DevOps Orchestrators → for rollout, environment management, and secrets injection
QA Cluster → for validating edition/role-specific behavior under config variants
Studio UI → for editing and visualizing tenant-specific config

🧬 High-Level Diagram – ECS in Factory¶

flowchart TD
    subgraph Factory Microservices
        A1[Service A] --> ECS
        A2[Service B] --> ECS
        A3[Service C] --> ECS
    end

    subgraph Core Infrastructure
        ECS[External Configuration Server]
        Vault[(Secrets / Key Vault)]
        Bus[(Event Bus)]
    end

    subgraph Agents
        ArchitectAgent --> ECS
        DevOpsAgent --> ECS
        QAAgent --> ECS
        StudioUI --> ECS
    end

    ECS --> Vault
    ECS --> Bus

Hold "Alt" / "Option" to enable pan & zoom

📘 Principles¶

DDD-Centric: Config treated as a domain object with aggregates (TenantConfig, EditionConfig).
Clean Architecture: Ports/adapters for REST, gRPC, Event Bus.
Event-Driven: Publish config changes as CloudEvents.
Observability-First: All reads/writes are traceable with OpenTelemetry.
Security-Hardened: Role-based access, integration with OpenIddict/Azure AD.
Cloud-Native: Designed for AKS, scaling with caching + distributed persistence.

The External Configuration Server will become the backbone of runtime flexibility in the ConnectSoft AI Factory, enabling:

🔄 Safe, dynamic reconfiguration
🏢 Tenant- and edition-aware overrides
📊 Observability and auditability
⚡ Reliable integration via APIs, gRPC, and event streams

Without ECS, the Factory remains brittle, environment-dependent, and costly to scale across tenants.

🛠 Functional & Non-Functional Requirements¶

📋 Functional Requirements¶

The External Configuration Server (ECS) must deliver a comprehensive feature set to support the ConnectSoft Factory’s scale and principles.

Area	Functional Requirement	Description / Notes
Configuration Management	CRUD for Config Objects	Create, update, delete, and retrieve configuration entities (TenantConfig, EditionConfig, ServiceConfig).
	Hierarchical Config Resolution	Resolve configuration in layers: Global → Edition → Tenant → Service. Supports overrides and fallbacks.
	Versioned Config	All configurations are versioned to allow rollbacks, diffs, and audit trails.
	Dynamic Updates	Config changes are pushed in real time via gRPC streams / Event Bus.
	Environment-Aware Config	Separate resolution paths for Dev, Test, Staging, Production.
Multi-Tenancy	Tenant Isolation	Each tenant’s configuration is logically and physically isolated.
	Edition Overrides	Editions can apply overrides at the feature or property level.
Access & Security	Role-Based Access Control (RBAC)	Fine-grained policies: who can read, write, or override configs.
	Integration with Identity Providers	Supports OpenIddict and Azure AD for authentication and authorization.
	Secret Handling	Sensitive values delegated to Key Vault/Secrets Manager (not stored in ECS DB).
APIs & Interfaces	REST API	CRUD + query endpoints.
	gRPC Interface	High-performance communication for microservices.
	Event Streaming	Publish config changes as CloudEvents to Kafka/Service Bus/NATS.
	Admin UI / Studio Integration	Visualization and management of configs.
Observability	Full Audit Logging	Every read/write recorded with user/service identity and traceId.
	Metrics	Config read/write latency, cache hit/miss, per-tenant stats.
	Distributed Tracing	OpenTelemetry traces link config requests with service spans.

📊 Non-Functional Requirements¶

Category	Requirement	Target / Notes
Scalability	Horizontal scaling	ECS must handle thousands of config reads/sec across 3000+ services.
	Caching Layer	Configs cached in-memory + distributed cache (Redis) with invalidation on change.
Availability	High availability	≥99.9% SLA, active-active across regions.
	Zero-downtime updates	Config server upgrades must not disrupt consumers.
Performance	Low-latency reads	<50ms per config resolution under load.
	High throughput	Support ≥20K config resolutions/min per cluster.
Security	Data encryption	TLS 1.3 in transit, AES-256 at rest.
	Secret isolation	No secrets persisted; all resolved via Vault/Key Vault.
Reliability	Idempotency	Config change events deduplicated using semanticHash + traceId.
	Retry mechanisms	Client SDKs and server ensure at-least-once delivery.
Compliance & Audit	Full traceability	Retain config change history for ≥2 years.
	Policy enforcement	Built-in compliance checks (e.g., no empty values for critical configs).

🧬 Diagram – Config Resolution Layers¶

flowchart TB
    Global[Global Config] --> Edition[Edition Config]
    Edition --> Tenant[Tenant Config]
    Tenant --> Service[Service Config]
    Service --> Final[Resolved Runtime Config]

    Final -->|Delivered to| Microservice

Hold "Alt" / "Option" to enable pan & zoom

Interpretation:

Global defaults apply first.
Editions override global.
Tenants override editions.
Services override tenants.
Result = fully resolved, policy-compliant runtime config.

The External Configuration Server must support multi-level, versioned, observable, and secure configuration management at scale. Key requirements include:

Functional: CRUD, versioning, overrides, real-time updates, multi-tenant enforcement.
Non-Functional: High availability, low-latency reads, strong security, full audit trail.

These requirements form the acceptance criteria for the PRD, guiding architecture and implementation decisions.

🛠 Domain Model & DDD Aggregates¶

🧩 Domain-Driven Design Context¶

The ECS is not just a key-value store; it is a domain service where configuration is a first-class aggregate with policies, lifecycle, and traceability. We treat configuration as structured, multi-level domain objects that evolve over time.

🏗 Key Aggregates & Entities¶

Aggregate / Entity	Description	Key Properties	Relationships
TenantConfig (Aggregate Root)	Represents the configuration context of a specific tenant.	`TenantId`, `ConfigSet`, `Overrides`, `CreatedBy`, `UpdatedBy`, `Version`	Contains multiple ServiceConfigs; inherits defaults from EditionConfig.
EditionConfig (Aggregate Root)	Captures configuration defaults for a specific product edition (e.g., Basic, Pro, Enterprise).	`EditionId`, `ConfigSet`, `Overrides`, `PolicyRules`, `Version`	Linked to many TenantConfigs; overrides GlobalConfig.
GlobalConfig (Aggregate Root)	Global system-wide defaults applied before edition or tenant overrides.	`GlobalId`, `DefaultValues`, `Policies`, `Version`	Parent to EditionConfig.
ServiceConfig (Entity)	Service-specific configuration for a tenant (e.g., `BillingServiceConfig`).	`ServiceId`, `ConfigItems`, `Version`	Child of TenantConfig.
ConfigItem (Value Object)	Atomic key-value pair with metadata.	`Key`, `Value`, `Type`, `IsSecret`, `IsFeatureToggle`	Immutable; part of ServiceConfig or higher config.
ConfigVersion (Entity)	Snapshot of configuration state at a point in time.	`VersionId`, `Timestamp`, `ChangeLog`, `Hash`	Belongs to any Config Aggregate for rollback.
PolicyRule (Entity)	Validation and enforcement rules (e.g., “must not be empty”, “allowed range”).	`RuleId`, `Scope`, `Expression`, `Severity`	Applied during config creation/updates.
AuditLog (Entity)	Records all reads/writes with full context.	`LogId`, `Actor`, `Action`, `TraceId`, `Timestamp`	Linked to TenantConfig/EditionConfig changes.

📚 Relationships in ECS¶

erDiagram
    GlobalConfig ||--o{ EditionConfig : defines
    EditionConfig ||--o{ TenantConfig : appliesTo
    TenantConfig ||--o{ ServiceConfig : contains
    ServiceConfig ||--o{ ConfigItem : holds
    TenantConfig ||--o{ ConfigVersion : snapshots
    TenantConfig ||--o{ AuditLog : records
    PolicyRule ||--o{ ConfigItem : validates

Hold "Alt" / "Option" to enable pan & zoom

⚖️ Aggregate Responsibilities¶

GlobalConfig – foundation of defaults across the ecosystem.
EditionConfig – edition-specific overrides and policies.
TenantConfig – tenant-specific configuration scope, isolation, and service configs.
ServiceConfig – service-level overrides within a tenant.
ConfigItem – immutable value objects ensuring integrity.
ConfigVersion – ensure traceability, rollback, and deterministic replays.
AuditLog – maintain compliance and accountability.
PolicyRule – ensure configs remain valid and secure.

📘 Domain Events¶

ECS aggregates will emit domain events (later mapped to CloudEvents):

ConfigCreated
ConfigUpdated
ConfigDeleted
ConfigVersioned
ConfigRollbackPerformed
PolicyViolationDetected

These events feed event sourcing, observability, and downstream automation (QA, DevOps, Agents).

We’ve defined the domain model for ECS using DDD principles:

Aggregates: GlobalConfig, EditionConfig, TenantConfig
Supporting entities: ServiceConfig, ConfigVersion, PolicyRule, AuditLog
Value objects: ConfigItem
Domain events for traceability

This structure ensures clear boundaries, scalability, and auditability of configurations across 3000+ services.

🛠 High‑Level Architecture (HLA)¶

🎯 Architecture Goals & Constraints¶

Goals

Low‑latency, highly available config reads for 3k+ services
Deterministic resolution across Global → Edition → Tenant → Service
Real‑time change propagation (streaming + events)
Tenant/edition isolation with RBAC and full auditability
Cloud‑native deployment (AKS), observability‑first, security‑first

Constraints

No secrets at rest in ECS (resolve by reference to Key Vault)
Clean Architecture + DDD boundaries
Event‑driven integration (CloudEvents over MassTransit/Azure Service Bus)
Idempotent writes and cache‑safe reads

🧱 Clean Architecture Layers (Ports & Adapters)¶

Layer	Responsibilities	Examples (ECS)
Domain	Aggregates, policies, invariants, domain events	`TenantConfig`, `EditionConfig`, `PolicyRule`, `ConfigVersion`
Application	Use cases, orchestration, transactions	`ResolveConfig`, `PublishChange`, `CreateSnapshot`, `Rollback`
Interfaces (Adapters)	REST, gRPC, events, CLI/Admin UI adapter	Controllers, gRPC services, Event publisher/subscriber
Infrastructure	Persistence, cache, secrets, bus, telemetry	NHibernate/Azure SQL, Redis, Blob, Key Vault client, MassTransit

🗺️ Component Topology (Logical)¶

graph TD
  subgraph Clients
    AppSDK[Client SDK (.NET/JS/Java)]
    Studio[Studio/Admin UI]
  end

  subgraph ECS Core
    API[Config API (REST/gRPC)]
    AdminAPI[Admin API]
    Resolver[Resolver Service]
    Policy[Policy Engine]
    Notifier[Change Notifier]
    Streamer[Watch/Streamer]
    Snapshotter[Snapshot & Archive Worker]
    Projector[Audit Projector]
  end

  subgraph Data & Infra
    SQL[(Azure SQL / PostgreSQL)]
    Redis[(Redis Cache)]
    Blob[(Blob Storage: Snapshots)]
    KV[(Azure Key Vault)]
    Bus[(Azure Service Bus / Kafka via MassTransit)]
    OTEL[(OTel Exporters)]
  end

  AppSDK-->API
  Studio-->AdminAPI
  API-->Resolver
  Resolver-->Policy
  Resolver--hot read-->Redis
  Resolver--strong read-->SQL
  Resolver--secret refs-->KV
  AdminAPI--writes-->SQL
  AdminAPI--emit events-->Bus
  Notifier--fanout-->Bus
  Streamer--server push-->AppSDK
  Snapshotter--store-->Blob
  Projector--audit views-->SQL
  API--telemetry-->OTEL
  AdminAPI--telemetry-->OTEL

Hold "Alt" / "Option" to enable pan & zoom

🔁 Core Execution Flows¶

1) Read/Resolve (hot path)¶

sequenceDiagram
  participant S as Service (Client SDK)
  participant API as Config API
  participant R as Resolver
  participant C as Redis Cache
  participant DB as SQL Store
  participant KV as Key Vault

  S->>API: GET /v1/config/resolve?tenant&env&app&prefix
  API->>R: Resolve(request, principal, scopes)
  R->>C: TryGet(prefix, etag)
  alt CacheHit
    C-->>R: Resolved payload + etag
  else Miss/Stale
    R->>DB: Query hierarchical config (G→E→T→S)
    R->>KV: Resolve secret references (if any)
    R->>C: Set(prefix, resolved, etag, TTL)
  end
  R-->>API: Resolved config + etag
  API-->>S: 200 OK (ETag, traceId)

Hold "Alt" / "Option" to enable pan & zoom

Notes

ETag for client‑side caching; If‑None‑Match supported.
Secret values never stored in SQL/Redis — only resolved at read via Key Vault.

2) Write/Publish (admin path)¶

sequenceDiagram
  participant Admin as Admin UI/CLI
  participant AAPI as Admin API
  participant DB as SQL
  participant Policy as Policy Engine
  participant Bus as Event Bus
  participant C as Redis

  Admin->>AAPI: POST /v1/config/{scope} (If-Match: etag)
  AAPI->>Policy: Validate (syntax, rules, scope, RBAC)
  Policy-->>AAPI: OK/Violation
  alt OK
    AAPI->>DB: Upsert aggregate (new version)
    AAPI->>Bus: Publish ConfigUpdated(cloudevent)
    AAPI->>C: Invalidate affected prefixes
    AAPI-->>Admin: 202 Accepted (versionId)
  else Violation
    AAPI-->>Admin: 400/409 with details
  end

Hold "Alt" / "Option" to enable pan & zoom

Notes

Optimistic concurrency via ETag/Version.
Idempotency key (semanticHash, traceId) to dedupe repeated submissions.

3) Watch/Streaming updates¶

gRPC streaming or SSE/WebSocket.
Server pushes diffs and new etag/version when relevant keys change.
SDK applies delta and refreshes local cache; raises typed callbacks.

4) Snapshot & Rollback¶

Snapshotter runs scheduled jobs or on‑demand to persist full resolved views to Blob.
Rollback creates a new version from a prior snapshot (never mutates history).
Projector maintains audit/read models for Studio timelines.

🧠 Tenancy, Editions & Scoping¶

Scope model: Global → Edition → Tenant → Environment → Service → Tag(s)
Deterministic precedence; last‑write wins at same level with version guards.
All queries filtered by tenantId, edition, environment, and RBAC scopes.
Policy Engine enforces forbidden prefixes and required keys per scope.

🚀 Performance & Caching Strategy¶

Read‑optimized: Redis front‑cache for prefixes/namespaces; stale‑while‑revalidate.
Write invalidation: precise key‑space eviction (prefix wildcards).
Batched DB queries with computed overlay merge at the server (single roundtrip).
Hot paths instrumented with P95 targets; background warmers for critical tenants.

🔐 Security Architecture¶

AuthN: OIDC (OpenIddict/AAD).
AuthZ: Tenant‑scoped RBAC (Admin/Operator/Reader/Auditor). Fine‑grained by key prefix and scope.
Data: TLS 1.3, encryption at rest, allow‑list of output content types (JSON/YAML/INI).
No secret persistence: store only references; resolve via Key Vault with managed identity.
Signed CloudEvents to prevent tampering; event payload links to audit record.

📡 Observability & Audit¶

OpenTelemetry traces on every request (traceId, tenantId, actor, scope, etag, result).
Metrics: read latency, cache hit ratio, per‑tenant QPS, write throughput, policy violations, stream fan‑out lag.
Structured logs with redaction and PII guards.
Audit Projector provides timeline views: who changed what, when, why (reason/issue link).

🧩 Persistence Model¶

Relational store (Azure SQL/PostgreSQL) for aggregates, versions, policies, audit projections.
Blob for snapshots/archives and large documents.
Redis for hot key‑spaces (read path).
Event Bus for reactive consumers (cache invalidation, stream fan‑out, CI hooks).

🛠 Technology Choices (concrete)¶

.NET 8 / ASP.NET Core
NHibernate (relational persistence), FluentValidation for DTOs
MassTransit + Azure Service Bus (events)
Redis (StackExchange.Redis), Azure SQL/PG (primary store), Azure Blob, Key Vault
gRPC for SDK channel; REST for admin and operational APIs
OpenTelemetry + Serilog; Grafana/Prometheus/App Insights

🧯 Reliability Patterns¶

Idempotent writes (semanticHash + traceId)
Retry with jitter for transient DB/Bus ops; circuit breakers around KV/DB
Backpressure on stream fan‑out; per‑tenant rate limits
Dead‑letter topics for failed events; replay from snapshots

🧪 Failure Scenarios & Mitigations (examples)¶

Scenario	Mitigation
Redis outage	Fallback to DB read (degraded latency), disable SWR, increase result TTL on recovery
Key Vault throttling	Cache positive resolutions with short TTL; exponential backoff; circuit to “secret‑unavailable” marker
Event Bus lag	Streamer polls on pull‑mode fallback; admin banner indicates lagged propagation
Hot‑spot tenant	Per‑tenant cache partitioning; prefix sharding; SDK exponential backoff

☸️ Deployment Topology (AKS)¶

Stateless API pods behind internal LB; HPA based on RPS/latency
Streamer replicas scaled by subscription count
Workers (Snapshotter/Projector) with KEDA triggers (queue depth/CRON)
Zonal redundancy; rolling upgrades; PodDisruptionBudgets
Helm charts with environment overlays; GitOps optional

📑 ADR Backlog (to be authored)¶

ADR‑001: Hierarchical Resolution Strategy (server‑side overlay, determinism)
ADR‑002: Secrets by Reference vs inline storage (KV integration)
ADR‑003: Event Transport selection (ASB vs Kafka) and CloudEvents schema
ADR‑004: Cache Topology (prefix caches, invalidation, SWR)
ADR‑005: Streaming Protocol (gRPC vs SSE/WebSockets) and backpressure
ADR‑006: Persistence (Azure SQL vs PostgreSQL) and NHibernate mappings
ADR‑007: RBAC Model (tenant/environment/key‑prefix scopes)
ADR‑008: Snapshot & Rollback semantics (append‑only versioning)
ADR‑009: Observability Defaults (required spans, logs, metrics)

We defined the ECS high‑level architecture: clean layering, core components, execution flows, tenancy/security enforcement, observability, and cloud‑native deployment. This blueprint is ready for detailed APIs, schemas, and SDK contracts.

✅ ECS PRD: Functional & Non‑Functional Requirements (Ready for ADO Epics)**¶

Scope: External Configuration Server (ECS) for multi‑tenant, edition‑aware, policy‑driven runtime configuration across ConnectSoft microservices (.NET, Clean Architecture, DDD, EDA). Outcome: A complete PRD slice that we can decompose into Azure DevOps Epics/Features/Stories in the next cycle.

🧭 Product Scope & Boundaries¶

In‑scope

Centralized configuration resolution and delivery for services (REST/gRPC/SDK).
Multi‑tenant + multi‑edition overrides with policy enforcement (RBAC, ABAC).
Environment layering (global → environment → edition → tenant → service → instance).
Versioning, rollout, preview, audit, and rollback.
Observability (metrics, logs, traces, audit trail) and event notifications.
Integrations: Azure Service Bus (events), Azure Key Vault (secrets reference), Redis (edge cache).

Out‑of‑scope (v1)

Secrets storage (managed by Key Vault; ECS stores references).
Feature experimentation framework (flags supported; experiments later).
UI Studio admin (initially minimal CRUD UI; advanced UX later).

🧩 Domain Model (DDD)¶

Aggregates

Tenant (TenantId, Name, Status, Editions[])
Edition (EditionId, Name, PolicySet)
ConfigBundle (BundleId, Scope, Keys[], Version, CreatedBy, CreatedAt)
ConfigItem (Key, Value, Type, SchemaRef, Metadata)
Policy (RBAC roles, ABAC conditions, constraints e.g., max TTL)
ResolutionRequest (ServiceId, TenantId, EditionId, Environment, InstanceTags[])
ResolutionResult (ResolvedMap, VersionGraph, SourceTrace[], ETag)

Scopes & Precedence (highest wins)

Instance (Pod/Slot)
Service (Microservice)
Tenant
Edition
Environment (dev/stage/prod)
Global

Conflict resolution = closest scope wins, then latest effective version, then policy constraint.

🧠 Core Functional Requirements¶

1) Read/Resolve¶

FR‑R1: Resolve configuration at runtime: Resolve(serviceId, tenantId, editionId, env, tags[]) → map
FR‑R2: Conditional resolution with preview mode (no write/audit side‑effects).
FR‑R3: Strong caching: ETag support; 304 semantics; per-scope TTL.
FR‑R4: Watch/Subscribe for change events via gRPC streaming or SSE.
FR‑R5: ResolveTrace: return provenance (which scopes/versions produced each key).

2) Write/Manage¶

FR‑W1: CRUD for Bundles & Items per scope; batch upserts with schema validation.
FR‑W2: Versioning on each change; FR‑W3: Rollback to prior version.
FR‑W4: Policy aware writes (RBAC roles + ABAC e.g., “only Ops may change prod”).
FR‑W5: Dry‑run validation (schema + policy + conflict preview).

3) Policy & Security¶

FR‑P1: RBAC roles: ECS.Admin, ECS.Editor, ECS.Auditor, ServiceAccount.
FR‑P2: ABAC conditions: environment, tenant, edition, label selectors.
FR‑P3: Audit every read/write (who/what/when/from where, masked values).
FR‑P4: Secrets as references: kvref://{vault}/{secretName}@{version}

4) Events & Integration¶

FR‑E1: Emit CloudEvents: ConfigChanged, PolicyChanged, BundlePublished, RollbackPerformed.
FR‑E2: Outbox pattern to Azure Service Bus; retries + idempotency keys.
FR‑E3: SDK callback hooks for hot‑reload in services.

5) Observability¶

FR‑O1: OTEL spans for resolve & write; FR‑O2: metrics (p95 resolve latency, cache hit ratio, 4xx/5xx).
FR‑O3: Structured logs with traceId, tenantId, serviceId, scope, keyCount.

🔒 Security & Compliance Requirements (NFR‑Security)¶

NFR‑S1: OIDC (OpenIddict/Azure AD). Service‑to‑service via client credentials; scopes per operation.
NFR‑S2: Data encryption at rest (DB) + TLS in transit.
NFR‑S3: PII/secret redaction in logs & audit streams; schema flags: sensitivity: pii|secret.
NFR‑S4: Multi‑tenant isolation in data partition & query filters; per‑tenant keys/ETags.

⚙️ Performance & Reliability (NFR‑Perf/Rel)¶

NFR‑P1: p95 resolve ≤ 30 ms (hot path with Redis); cold ≤ 150 ms.
NFR‑P2: Peak read QPS ≥ 20k (horizontally scalable API replicas).
NFR‑R1: Availability SLO 99.95% for read APIs.
NFR‑R2: Zero‑downtime publish; rolling upgrades; blue/green for storage migrations.
NFR‑C1: Config size per resolution ≤ 512 KB (soft limit), item size ≤ 16 KB (hard).

🧪 Quality & Validation (NFR‑QA)¶

NFR‑Q1: Contract tests for REST/gRPC (& SDK) with golden fixtures.
NFR‑Q2: Chaos tests: cache node loss, bus outage, DB failover; system remains read‑available (stale‑ok).
NFR‑Q3: Security tests for RBAC/ABAC bypass attempts; negative path coverage ≥ 95% rules.

🌉 External Interfaces¶

REST (subset)¶

GET /v1/resolve?serviceId&tenantId&editionId&env&tags=... → 200 {data, eTag, trace}
GET /v1/watch?serviceId... (SSE) → event: ConfigChanged
POST /v1/bundles/{scope} (upsert; JSON Schema validation)
POST /v1/preview/resolve (dry‑run)
POST /v1/rollback (bundleId, targetVersion)

gRPC¶

Resolve() unary; Watch() server stream with backoff hints.

Events (Azure Service Bus topics)¶

ecs.config.changed
ecs.bundle.published
ecs.policy.changed
ecs.bundle.rolledback

🧰 Client SDK (dotnet)¶

Package: ConnectSoft.Ecs.Client Features:

Typed options binding: services.AddEcsConfig<TOptions>("namespace:keyPrefix")
Hot reload via IOptionsMonitor and gRPC watch
Circuit‑breaker + per‑key caching with ETag
Secrets resolver (kvref://) with Key Vault client

builder.Services.AddEcsClient(o =>
{
  o.ServiceId = "billing-service";
  o.Environment = "prod";
  o.Edition = "enterprise";
  o.TenantIdProvider = () => TenantContext.Current?.TenantId;
});

🏗️ High‑Level Architecture¶

flowchart LR
  subgraph Clients
    SvcA[Service A]:::svc -->|Resolve/Watch| API
    SvcB[Service B]:::svc -->|Resolve/Watch| API
  end

  subgraph ECS
    API[REST/gRPC API]:::core
    Resolver[Resolver Engine]:::core
    Policy[Policy Engine RBAC/ABAC]:::core
    Cache[(Redis Edge Cache)]:::infra
    Store[(Config DB)]:::infra
    Outbox[Outbox]:::core
    Bus[[Azure Service Bus]]:::infra
    Audit[(Audit Store)]:::infra
    KV[[Azure Key Vault]]:::infra
  end

  API --> Resolver --> Cache
  Resolver <--> Store
  Resolver --> Policy
  Resolver --> KV
  API --> Outbox --> Bus
  API --> Audit

  classDef core fill:#E0F2FE,stroke:#38BDF8,stroke-width:1.2px;
  classDef infra fill:#F5F3FF,stroke:#8B5CF6,stroke-width:1.2px;
  classDef svc fill:#ECFCCB,stroke:#84CC16,stroke-width:1.2px;

Hold "Alt" / "Option" to enable pan & zoom

Storage options (pluggable via Clean Architecture):

Primary: SQL (PostgreSQL/SQL Server) via NHibernate.
Optional: Cosmos DB provider.
Cache: Redis (clustered).

🔁 Resolution Algorithm (Deterministic)¶

Identify scope chain from request metadata.
Load latest effective versions for each scope (bundle snapshots).
Merge maps top‑down (global→env→edition→tenant→service→instance).
Apply policy constraints (deny/override/mask).
Expand secret references (Key Vault) if caller has scope (allowInlineSecrets=false by default).
Compute ETag (stable hash); return map + SourceTrace.

Edge cases:

Missing keys → default value policy (deny/allow default).
Conflicts → highest precedence wins; policy can block high precedence if violating constraints.

📊 Observability Contract¶

Metrics

ecs_resolve_latency_ms (p50/p95/p99)
ecs_cache_hit_ratio
ecs_resolve_qps
ecs_write_errors_total
ecs_stream_clients_gauge

Logs & Traces

Enriched with traceId, tenantId, serviceId, scope, bundleId, version, redaction flags.

Audit

who, when, what, where (ip/user-agent), before/after diff (masked).

🚀 Delivery & Rollout¶

Blue/Green API pods, Redis with keyspace notifications off (we push invalidate signals via Bus).
Write path: transactional write → outbox → publish → cache invalidate (fan‑out key patterns).
Hot Reload: streaming watchers receive ConfigChanged with ETag → services re‑resolve.

🧯 Failure Modes & Recovery¶

Cache outage → fallback to DB (latency up; circuit policy).
Bus outage → outbox retry; watchers retry with jitter.
DB outage → serve last-known ETag from cache for N minutes (stale‑ok policy); emits degraded events.

📐 Acceptance Criteria (Representative)¶

AC‑R‑01: Given a tenant + service + edition chain, when a config key exists in multiple scopes, then the service‑scoped value is returned and SourceTrace lists all overridden values.
AC‑W‑02: A write with invalid schema returns 422 and does not create a new version; audit shows failure reason.
AC‑P‑03: A user with ECS.Editor cannot write to env=prod without abac: env=prod condition → 403.
AC‑E‑04: On successful publish, a ConfigChanged event is emitted and ≥95% subscribed services observe change within 3s (under nominal load).
AC‑O‑05: p95 resolve latency ≤ 30 ms with warm cache under 5k RPS.

📦 Data Model (Simplified)¶

ConfigBundle:
  bundleId: guid
  scope:
    level: Global|Environment|Edition|Tenant|Service|Instance
    identifiers: { environment?, editionId?, tenantId?, serviceId?, instanceId? }
  version: int
  items: [ConfigItem]
  tags: [string]
  checksum: string
  createdBy: userId
  createdAt: datetime
  status: Draft|Published|Deprecated

ConfigItem:
  key: string
  type: string # string|int|bool|json|uri|kvref
  value: any
  schemaRef: uri?
  metadata: { sensitivity?: pii|secret, ttl?: int, description?: string }

🗺️ Azure DevOps Backlog Shape (Preview — detailed breakdown next cycle)¶

Epic A — Resolution & Delivery

Feature A1: Resolve API + gRPC + ETag
Feature A2: Watch/Subscribe streaming
Feature A3: Redis caching + invalidation

Epic B — Authoring & Versioning

Feature B1: Bundle/Item CRUD + batch upsert
Feature B2: Versioning & rollback
Feature B3: Preview/dry‑run + diff

Epic C — Policy & Security

Feature C1: RBAC/ABAC engine
Feature C2: Audit & redaction
Feature C3: OIDC integration & scopes

Epic D — Events & SDK

Feature D1: CloudEvents + Outbox to ASB
Feature D2: .NET SDK (OptionsMonitor, KeyVault resolver)
Feature D3: Sample integration in 2 reference services

Epic E — Observability & SLOs

Feature E1: OTEL spans, metrics, logs
Feature E2: Dashboards & SLO checks
Feature E3: Chaos/resiliency tests

📌 Definitions of Done (per Feature)¶

✅ API/gRPC contract approved (OpenAPI/proto + breaking change check).
✅ Unit/integration/contract tests ≥ 85% line/branch in feature scope.
✅ OTEL + logs + metrics + audit events present & validated.
✅ Security checks (RBAC/ABAC/PII redaction) pass.
✅ Load tests meet SLOs (p95, error rate, cache hit).
✅ Docs: README, runbook, examples, SDK snippet.

🔗 Integration Points & APIs¶

🎯 Purpose¶

The External Configuration Server (ECS) must expose consistent, secure, and flexible APIs to integrate with all consumers in the ConnectSoft AI Factory ecosystem. These interfaces enable real-time retrieval, updates, and subscriptions to configuration values, ensuring both human and machine actors interact seamlessly.

📡 Integration Interfaces¶

Interface Type	Purpose	Consumers
REST API	CRUD operations on tenant/edition configs, metadata management	Studio UI, DevOps tools, external systems
gRPC	High-performance, strongly typed API for runtime config resolution	Microservices, Agents (low-latency config fetch)
Event Bus	Publish/subscribe to configuration changes as CloudEvents v1.0	Domain services, Architect Agents, QA Agents
SDKs/Libraries	Thin client libraries (C#, TypeScript, Python) for easy config injection	Factory microservices, test harnesses

📘 API Contracts¶

REST¶

GET /config/{tenant}/{service} → Retrieve active config
POST /config/{tenant}/{service} → Create/update config
GET /history/{tenant}/{service} → Fetch config audit trail
POST /rollback/{tenant}/{service} → Rollback to prior version

gRPC¶

ResolveConfig(ResolveRequest) → ResolveResponse
StreamConfigUpdates(SubscriptionRequest) → stream ConfigEvent

Events (CloudEvents)¶

Type: factory.ecs.v1.config.updated
Attributes: tenantId, editionId, serviceName, version, traceId, semanticHash
DataRef: points to signed configuration snapshot in blob store

🧩 SDK & Client Libraries¶

C# ECS.Client
ConfigProvider.Resolve(service, tenant)
ConfigProvider.Subscribe(service, tenant, callback)
TypeScript SDK (for UI & frontend agents)
Hooks for live subscription updates
Python SDK (for AI agents & ML services)
Easy access to tenant configs in training/inference pipelines

🏗 Diagram – ECS Integration Surfaces¶

flowchart LR
    StudioUI -->|REST| ECS
    DevOpsAgent -->|REST| ECS
    MicroserviceA -->|gRPC| ECS
    MicroserviceB -->|gRPC| ECS
    ArchitectAgent -->|Events| ECS
    QAAgent -->|Events| ECS
    ECS -->|SDKs| ClientLibs[(C#/TS/Python SDKs)]

Hold "Alt" / "Option" to enable pan & zoom

📘 Principles for Integration¶

Idempotency: All API updates are idempotent (semanticHash-based).
Traceability: Each API call returns a traceId for correlation.
Security-First: All endpoints secured with OAuth2 scopes (e.g., ecs.read, ecs.write).
Versioning: APIs versioned (/v1/, vNext) with strong backward compatibility.
Polyglot Support: SDKs generated using OpenAPI/gRPC tooling for multiple languages.

➡️ With these integration points, ECS becomes an accessible backbone service, allowing any microservice, agent, or human operator to retrieve and react to configuration in real-time.

🏢 Multi-Tenancy & Edition Management¶

🎯 Purpose¶

The External Configuration Server (ECS) must support multi-tenancy and edition-aware configuration management as a first-class concern. This ensures that different customers (tenants) and their specific editions (pricing or functional tiers) can receive isolated, policy-driven configurations without conflict.

🌍 Multi-Tenancy Principles¶

Isolation by Tenant → Configurations are stored, resolved, and audited per tenant boundary.
Shared Infrastructure, Segregated Data → ECS runs as a shared service, but configuration data is tenant-scoped.
RBAC-Scoped Access → Only authorized roles (e.g., Tenant Admin, Factory Operator) can mutate tenant-specific configs.
Noisy Neighbor Protection → Rate limiting and quotas prevent one tenant from overloading ECS capacity.

🧩 Edition Awareness¶

Base Business Model → Provides default configuration values (global, edition-agnostic).
Edition Overrides → Editions (e.g., Free, Standard, Enterprise) override global values for functionality toggles, limits, or integrations.
Tenant Overrides → Individual tenants may further override edition defaults, within defined policies.
Policy Guardrails → Edition-level limits (e.g., “Enterprise supports unlimited API calls, Free has 10k/month”) are enforced automatically.

📘 Hierarchical Resolution Model¶

Config resolution follows a hierarchical override chain:

Global Default → Edition Config → Tenant Config → Runtime Override

Example:

Default: MaxUsers = 100
Edition: Enterprise → MaxUsers = 1000
Tenant ABC (Enterprise) → MaxUsers = 1500 (if allowed by policy)

🔐 Security & Governance¶

Per-Tenant Encryption → Config values encrypted with tenant-specific keys in Azure Key Vault.
Audit Scope → Every change linked to tenant, edition, actor, and traceId.
Cross-Tenant Safety → ECS enforces absolute separation—no tenant can read another’s configuration.

🏗 Diagram – Multi-Tenant & Edition Config Flow¶

flowchart TD
    Global[🌐 Global Config] --> Edition[📦 Edition Config]
    Edition --> Tenant[🏢 Tenant Config]
    Tenant --> Runtime[⚡ Runtime Override]

    Runtime --> ECS[(External Config Server)]
    ECS --> ServiceA[Service A]
    ECS --> ServiceB[Service B]

Hold "Alt" / "Option" to enable pan & zoom

📊 Example Use Case¶

Tenant	Edition	Config Key	Value
Global	Default	`FeatureX`	OFF
All	Enterprise	`FeatureX`	ON
ABC	Enterprise	`FeatureX`	ON, custom rate 500/s
XYZ	Free	`FeatureX`	OFF

📘 Principles for Multi-Tenant Management¶

Edition-Aware Inheritance: Every tenant config inherits from its edition baseline.
Policy-Driven Overrides: Tenants can override edition configs only within guardrails.
Strict Isolation: No cross-tenant data visibility or leakage.
Auditability: All tenant/edition config changes fully traceable.
Dynamic Runtime Resolution: Overrides are resolved on every config fetch (no redeploys).

➡️ With this, ECS becomes SaaS-ready: one service, securely serving thousands of tenants with multiple editions, while preserving isolation, scalability, and flexibility.

💾 Persistence & Storage Design¶

🎯 Purpose¶

The External Configuration Server (ECS) requires a robust persistence layer to store, version, and retrieve configuration data across tenants, editions, and environments. This cycle defines how ECS persists configuration artifacts, ensures durability, and enables auditability at scale.

🏛 Core Principles¶

Multi-Model Storage → Structured metadata in SQL; flexible config documents in NoSQL.
Immutable Versioning → Every change creates a new version; rollback is always possible.
Separation of Concerns → Config metadata (who/when/where) stored apart from config payloads (what).
Tenant & Edition Isolation → Partitioning strategies prevent cross-tenant conflicts.
Cloud-Native Durability → Built on Azure SQL + Cosmos DB + Key Vault with geo-redundancy.

🗂 Data Model¶

Entities¶

ConfigItem
ConfigId (GUID)
TenantId
EditionId
Environment (Dev, QA, Prod)
PayloadRef (pointer to storage blob/NoSQL doc)
ConfigVersion
VersionId
ConfigId (FK)
Hash (SHA-256 for integrity)
CreatedAt
CreatedBy (User/Service ID)
ChangeReason
AuditLog
EntryId
ConfigId
VersionId
Actor
Action (Create, Update, Rollback)
TraceId

🗄 Storage Technologies¶

Layer	Technology	Purpose
Metadata	Azure SQL	Strong consistency for IDs, references, indexes
Payload Storage	Azure Cosmos DB	Flexible JSON docs for config payloads
Secrets	Azure Key Vault	Encrypt sensitive values per-tenant
Backups/Archives	Azure Blob	Cold storage of snapshots & exports

🔄 Versioning & History¶

Every config update produces a new immutable version.
Versions can be queried:
Latest only (default)
Specific version (for rollback/testing)
Range queries (audit & troubleshooting)
ECS guarantees event-sourced persistence → “What changed, when, and by whom” is always reconstructable.

🗃 Partitioning & Indexing¶

TenantId + EditionId = Primary partition key in Cosmos DB.
Environment indexed for fast lookups.
Hot configs cached in Redis for ultra-low-latency retrieval.

🔐 Security¶

All payloads encrypted at rest (AES-256).
Per-tenant encryption keys in Key Vault.
Fine-grained RBAC on read/write at DB layer.
Write operations only through ECS API (no direct DB access).

🏗 Diagram – Persistence Flow¶

flowchart TD
    App[Client Service] --> ECS[(External Config Server)]
    ECS --> SQL[(Azure SQL Metadata)]
    ECS --> COSMOS[(Cosmos DB Payloads)]
    ECS --> KeyVault[(Azure Key Vault)]
    ECS --> Blob[(Azure Blob Snapshots)]

Hold "Alt" / "Option" to enable pan & zoom

📊 Example¶

Tenant: VetClinicCo
Edition: Enterprise
Env: Production
Config: MaxConcurrentJobs = 500

Version history:

v1 → 100 (default)
v2 → 200 (edition override)
v3 → 500 (tenant override, policy-approved)

All versions persist in Cosmos DB with metadata in SQL and secured values in Key Vault.

✅ Principles Recap¶

Immutable Versioning – No destructive updates.
Multi-Model Storage – SQL + NoSQL hybrid design.
Per-Tenant Isolation – Partition + encryption per tenant.
Audit-First – Every change logged with traceability.
Cloud-Native Resilience – High availability + geo-redundant backup.

🔐 Security & Access Control¶

🎯 Goals¶

The External Configuration Server (ECS) must be secure by design, ensuring that only authenticated and authorized actors can access or modify configuration data. Security must address multi-tenancy, API surface hardening, and end-to-end trust across environments.

🧩 Security Requirements¶

Requirement	Implementation Strategy
Authentication	OpenIddict / Azure AD integration for OAuth2 & OIDC.
Authorization (RBAC/ABAC)	Fine-grained role-based and attribute-based access per tenant, edition, and environment.
Tenant Isolation	Scoped tokens that prevent cross-tenant data access. Tenant ID enforced in every query.
Secure Communication	gRPC over mTLS + REST endpoints with TLS 1.3.
Secrets Handling	Integrate with Key Vault (Azure Key Vault or HashiCorp Vault) for secret material.
Policy-Driven Overrides	Apply policies that enforce configuration rules at runtime (e.g., edition cannot override tenant global).
Audit & Traceability	Every read/write secured with claims and logged with trace context.

🔑 Authentication & Identity Flow¶

sequenceDiagram
    participant Client as Microservice/Agent
    participant ECS as External Config Server
    participant IDP as Identity Provider (OpenIddict/Azure AD)

    Client->>IDP: Request Token (Client Credentials / On-Behalf-Of)
    IDP-->>Client: Access Token (JWT w/ claims)
    Client->>ECS: Call API w/ Bearer Token
    ECS->>IDP: Validate Token (signature, expiry, audience, tenant claims)
    ECS-->>Client: Authorized Response (config values)

Hold "Alt" / "Option" to enable pan & zoom

Supported Grants: Client Credentials (service-to-service), Authorization Code (Studio UI), On-Behalf-Of (agent delegation).
JWT Validation: Issuer, audience, expiry, tenant ID, roles.

🛡️ Authorization Model¶

Roles¶

ConfigAdmin → Full control over tenant configs.
ConfigEditor → Create/update configs within tenant scope.
ConfigReader → Read-only access.
SystemObserver → Access to audit/logs for compliance.

Enforcement¶

RBAC enforced at API layer (per endpoint).
ABAC used for policy enforcement (e.g., “Edition config overrides allowed only if flag enabled”).

🏢 Tenant Isolation¶

Each request must carry a TenantID claim (in token).
ECS enforces row-level isolation at persistence level (per-tenant schemas or partition keys).
No cross-tenant visibility in APIs or events.

🔐 Secrets Handling¶

ECS does not store secrets directly; instead, it stores references (URIs, Key Vault keys).
Integration with:
Azure Key Vault for cloud deployments.
HashiCorp Vault for hybrid setups.
ECS fetches secrets at runtime with least-privilege identity (Managed Identity in Azure).

🔏 Secure API Surface¶

REST → secured via TLS 1.3 + OAuth2 Bearer tokens.
gRPC → secured via mTLS (mutual certificate-based auth) + OAuth2.
Rate limiting & throttling to prevent abuse.
CORS restrictions for Studio UI.

⚖️ Policy-Driven Overrides¶

Global → Tenant → Edition → Environment → Service hierarchy.
Policies enforce:
No tenant override of system-critical keys.
Edition policies can restrict tenant-level overrides.
Environments can enforce stricter security keys (prod > dev).

📊 Audit & Compliance¶

Every config read/write logged with:
Actor identity (user/service).
Tenant & edition.
Trace context (OpenTelemetry).
Immutable audit logs → stored in append-only storage (e.g., Azure Blob immutability policies).
Compliance: SOC2, GDPR (data access restrictions), ISO 27001 alignment.

✅ ECS gains end-to-end trust, fine-grained authorization, and policy-driven safeguards—making it resilient against misuse while enabling flexible tenant/edition operations.

📈 Observability & Monitoring¶

🎯 Goals¶

Give the External Configuration Server (ECS) complete, actionable visibility over read/write paths and propagation so teams can detect, debug, and prevent issues fast.

Metrics: SLOs/SLA tracking for resolve latency, availability, cache hit ratio, event lag.
Tracing: End-to-end spans across client → ECS → cache/DB → Key Vault → bus.
Logging: Structured, privacy-safe logs with correlation.
Audit Trails: Immutable evidence of who changed what, when, where, and why.
Dashboards & Alerts: Grafana/App Insights views and guardrail alerts wired to on-call.

🔭 Signal Model (four pillars)¶

Pillar	Purpose	Where
Metrics	SLOs, trends, capacity	Prometheus/App Insights Metrics
Traces	Causality & latency breakdown	OpenTelemetry → OTLP exporter
Logs	Forensic details & exceptions	Serilog → ELK/App Insights
Audit	Compliance-grade change evidence	Append-only store + query view

🧩 Telemetry Architecture¶

flowchart LR
  Client[(Service/SDK)] --OTEL ctx--> API[Config API]
  API --> Resolver
  API --> AdminAPI
  Resolver --> Redis[(Redis)]
  Resolver --> SQL[(SQL)]
  Resolver --> KV[(Key Vault)]
  AdminAPI --> Bus[[Event Bus]]
  API & AdminAPI --> OTEL[OpenTelemetry Exporter]
  OTEL --> Prom[Prometheus]
  OTEL --> AI[App Insights]
  Logs[Serilog/JSON] --> ELK[(ELK/App Insights Logs)]
  AdminAPI --> Audit[(Immutable Audit Store + Projections)]

Hold "Alt" / "Option" to enable pan & zoom

📊 Metrics (names, labels, targets)¶

Service-level

ecs_resolve_requests_total{tenant,env,service}
ecs_resolve_latency_ms{quantile=50|95|99, tenant, env} → p95 ≤ 30ms (hot), ≤150ms (cold)
ecs_resolve_errors_total{code}; rate(…[5m]) < 0.5%
ecs_cache_hit_ratio{tenant} → ≥ 0.85
ecs_event_fanout_lag_ms{topic} → p95 < 2000ms
ecs_stream_clients_gauge{node} (active watchers)
ecs_write_requests_total{op=create|update|rollback}
ecs_write_policy_violations_total{policyId, severity}

Infra

redis_ops_total{cmd}, redis_ttl_seconds_bucket
sql_query_latency_ms{query}
kv_request_latency_ms{op=getSecret} (sampled)
bus_publish_latency_ms{topic}

SLO rollups

slo_availability_ratio (read API) → ≥ 99.95%
slo_freshness_ratio (watchers receive change < 3s) → ≥ 95%

Store as Prometheus counters/histograms; mirror key KPIs into App Insights for product owners.

🧵 Tracing (OpenTelemetry)¶

Trace root: ResolveConfig (read) / WriteConfig (write)
Spans:
cache.get(prefix) / cache.set(prefix)
sql.query(hierarchy) / sql.upsert(bundle)
kv.resolve(secretRef) (redacted attributes)
bus.publish(ConfigChanged)
Trace context: trace_id, span_id, tenantId, env, serviceId, editionId, etag, version
Baggage (lightweight): request_class=hot|cold, cache_hit=true|false

Sampling

Default 10% for hot read path; 100% for errors/slow spans (p95+), 100% for writes.

🧾 Logging (structured & safe)¶

Format: JSON, Serilog.
Required fields: ts, level, message, traceId, spanId, tenantId, env, serviceId, actor, operation, status, durationMs
Redaction: never log secret values; mask values with "<redacted>"; log key paths and hashes only.
Error taxonomy:
ValidationError (422), PolicyViolation (403), ConcurrencyConflict (409), UpstreamTimeout (502), InternalFailure (500).

Example (write path)

{
  "ts":"2025-08-21T18:25:43Z",
  "level":"Information",
  "message":"Config bundle published",
  "traceId":"3f1f…",
  "tenantId":"t-abc",
  "env":"prod",
  "operation":"PublishConfig",
  "bundleId":"b-77b2",
  "version":42,
  "keysTouched":128,
  "policyViolations":0,
  "etag":"W/\"d41d8c\""
}

🧑‍⚖️ Audit Trails (immutable & queryable)¶

What is audited: All writes (create/update/delete/rollback), policy decisions, and admin reads of sensitive scopes.
Record: auditId, actor, actorType(user|service), action, scope(global|edition|tenant|service|instance), beforeHash, afterHash, reason, ip, userAgent, traceId, ts.
Storage:
Append-only (e.g., Blob with immutability policy / WORM).
Projection into SQL table(s) for timelines and Studio queries.
Retention: 24 months (configurable by tenant plan).
Export: Signed CSV/JSON with evidence chain (hash of file published to audit ledger topic).

📺 Dashboards (Grafana/App Insights)¶

Operations (NOC)

Read latency (p50/p95/p99) by env & tenant
Error rate by route
Cache hit ratio
Event fan-out lag
Stream clients trend

SRE

DB query time heatmap; Redis saturation; Key Vault throttling; Bus publish latency
Top tenants by QPS; hotspot keys/prefixes
SLA/SLO burn-down (error budget)

Product/Compliance

Changes by tenant/edition over time
Policy violations & severities
Audit export status

🚨 Alerts (examples)¶

Condition	Expression (PromQL-ish)	Action
p95 resolve latency > 60ms for 10m	`hist_quantile(0.95, rate(ecs_resolve_latency_ms_bucket[10m])) > 60`	Page SRE (P2)
Error rate > 1% for 5m	`rate(ecs_resolve_errors_total[5m]) / rate(ecs_resolve_requests_total[5m]) > 0.01`	Page SRE (P2)
Cache hit < 70% for 15m	`avg_over_time(ecs_cache_hit_ratio[15m]) < 0.7`	Ticket + Slack (P3)
Event lag p95 > 5s for 5m	`hist_quantile(0.95, rate(ecs_event_fanout_lag_ms_bucket[5m])) > 5000`	Page on-call (P2)
Audit write failures > 0	`increase(ecs_audit_write_errors_total[5m]) > 0`	Page SRE (P1)

🧪 Observability DoD (per feature)¶

OTEL spans present with correct attributes and parentage.
Metrics emitted with required labels (tenantId, env, serviceId).
Logs redact PII/secret content; include correlation IDs.
Audit entries created for all write operations with before/after hashes.
Dashboards panels updated; alert rules validated in staging chaos runs.

🛠 Runbook Snippets¶

Trace a slow read

Locate alert → pick traceId.
In traces, check spans: cache.get (miss?) → sql.query (slow?) → kv.resolve (throttled?).
If cache misses spike, check invalidation storm from recent publish.
Mitigate: raise TTL temporarily, throttle publishers, enable SWR.

Missing change in consumers

Check ecs_event_fanout_lag_ms & outbox backlog.
Verify subscriber count; look for DLQ on the topic.
If lag high: scale Notifier/Streamer; apply backpressure config.

🔒 Privacy & Compliance¶

PII/Secrets: never stored in metrics/traces/logs; only hashes and key paths permitted.
Data residency: metrics/logs stored regionally per tenant policy where required.
Access: dashboards segmented by role; audit exports require elevated scope and MFA.

✅ Outcome¶

ECS now has operational telemetry, forensic visibility, and compliance-grade auditability. With dashboards and alerts, issues are caught early and traced precisely—supporting our SLOs and tenant commitments.

⚡ Eventing & Change Propagation¶

How ECS publishes configuration changes and guarantees safe, observable fan‑out across the factory.

🎯 Goals¶

Make every configuration change a first‑class event (CloudEvents) with tenant/edition scope.
Provide low‑latency fan‑out to thousands of services via Azure Service Bus (default) and Kafka (optional).
Guarantee at‑least‑once delivery with consumer idempotency, retries, DLQ, and replay.
Support versioned configurations (optimistic concurrency, ETags, semantic versions, snapshots, and deltas).
Preserve traceability: traceId, changeSetId, configVersion on every hop via OpenTelemetry.

🧩 Event Model¶

Event Envelope (CloudEvents 1.0)¶

{
  "specversion": "1.0",
  "type": "com.connectsoft.ecs.ConfigurationChanged",
  "source": "/ecs/tenants/{tenantId}/namespaces/{ns}/keys/{configKey}",
  "id": "evt_{changeSetId}",
  "time": "2025-08-21T11:25:43.123Z",
  "datacontenttype": "application/json",
  "subject": "tenant:{tenantId}|edition:{edition}|env:{environment}",
  "traceparent": "00-{traceId}-{spanId}-01",
  "dataschema": "https://schemas.connectsoft.dev/ecs/v1/events/configuration-changed.json",
  "data": {
    "tenantId": "t-001",
    "edition": "enterprise",
    "environment": "prod",
    "namespace": "payments",
    "configKey": "payment.retryPolicy",
    "changeSetId": "cset_7f2a7c",
    "configVersion": "2025.08.21+00023",
    "operation": "Upsert",     // Upsert | Delete | Patch
    "payloadType": "Delta",    // Snapshot | Delta
    "etag": "W/\"b9-1e9b\"",
    "checksum": "sha256:3be2e...",
    "previousVersion": "2025.08.21+00022",
    "changedBy": "user:alice@connectsoft.ai",
    "reason": "Increase retries for PSP flakiness",
    "ttlSeconds": 600
  }
}

Why CloudEvents? Uniform schema for REST webhooks, Service Bus topics, Kafka topics, and gRPC streaming — minimizing adapters.

🛰️ Transports & Channels¶

Transport	ECS Role	Default Topology	Notes
Azure Service Bus	Primary	Topic `ecs.config.changed.v1` → Subscriptions per consumer group (service)	Delivery & retry semantics built‑in; rules for tenant/edition filters
Kafka	Optional	Topic `ecs.config.changed.v1` with tenant/edition partitions	High‑throughput; consumer offset storage
gRPC Server Streaming	Optional	`WatchConfig()` per `(tenantId, namespace, selector)`	For in‑cluster low‑latency watchers
Webhooks	Optional	Signed HTTP POST with CloudEvents envelope	For third‑party SaaS integrations

Topic Partitioning

Service Bus: use subscriptions with correlation filters on tenantId, edition, namespace, configKey.
Kafka: partition by hash(tenantId|namespace) for locality; key = tenantId|namespace|configKey.

🔄 Publisher Flow (ECS)¶

sequenceDiagram
  participant Client as Studio/API Client
  participant ECS as ECS API
  participant OUTBOX as ECS Outbox
  participant BUS as Service Bus / Kafka
  participant OTel as OTEL Exporter

  Client->>ECS: PUT /configs/{tenant}/{ns}/{key} (If-Match: etag)
  ECS->>ECS: Validate, authorize, persist (version + etag)
  ECS->>OUTBOX: Append Outbox record (changeSetId, payload)
  ECS->>OTel: Span: ecs.config.write
  OUTBOX-->>BUS: Publish CloudEvent
  BUS-->>*Consumers: Deliver event (at-least-once)

Hold "Alt" / "Option" to enable pan & zoom

Outbox pattern ensures atomic persistence + event publish (transactional outbox + background dispatcher).
Idempotent publish via (changeSetId, etag) de‑duplication at dispatcher.

🧠 Consumer Contract (SDK)¶

All ECS client SDKs (.NET first) expose a uniform consumption model:

public interface IEcsChangeSubscriber
{
    Task OnConfigurationChangedAsync(ConfigurationChangeEvent evt, CancellationToken ct);
}

// Usage:
await ecsSubscriber.SubscribeAsync(new EcsSubscription
{
    TenantId = "t-001",
    Namespace = "payments",
    Selector = "payment.*",             // glob / regex
    StartingVersion = "2025.08.21+00020"
});

Consumer Responsibilities (enforced by SDK defaults):

Idempotency: persist lastAppliedVersion per (tenantId, namespace, configKey); skip if evt.configVersion <= lastApplied.
Transactional Apply: update local cache + notify live options atomically; on failure, retry with exponential backoff.
Rehydration: on startup, call GET /snapshot?...since=lastAppliedVersion, then process backlog from bus.
Backpressure: apply bounded queues; expose ecs_consumer_backlog_size.

🧷 Delivery Semantics & Reliability¶

At‑Least‑Once Delivery¶

Service Bus: max delivery attempts (e.g., 10), dead‑letter after exhaustion.
Kafka: enable.auto.commit=false, commit offsets after successful apply.

Idempotency Patterns¶

Layer	Mechanism
Publisher	Outbox de‑dup on `(changeSetId, etag)`
Transport	Deterministic key for partitioning
Consumer	`lastAppliedVersion` store + checksum verification

Retries¶

Publisher: dispatcher retries with jitter, circuit‑breakers to BUS.
Consumer: SDK retries transient errors; poison events → DLQ with dead‑letter reason (checksum_mismatch, authorization_revoked, schema_incompatible).

Replay¶

Replay API: GET /events/stream?tenantId=...&fromVersion=...&toVersion=... (signed, time‑boxed).
Use cases: Disaster recovery, blue/green cutover, warm caches.

🧱 Versioning Strategy¶

Concept	Purpose	Implementation
ETag / If‑Match	Optimistic concurrency	`W/"b9-1e9b"` round‑trips on write
Monotonic Version	Ordering	`yyyy.MM.dd+build` (e.g., `2025.08.21+00023`)
Semantic Version (optional)	Compatibility signals	`1.4.0` attached as `configSemVer`
Snapshots	Full state at point‑in‑time	`GET /snapshot?tenantId&namespace`
Deltas	Network efficiency	Event `payloadType: Delta` with patch ops (RFC6902 JSON Patch)
Schema Evolution	Safe change	Avro/JSON Schema; `dataschema` URI versioned; compat checks at publish

Server Rules

Breaking change detected? Require force=true&ack=... header + elevated RBAC role + audit record.
Automatically emit compat warning events when consumers with old consumerSchemaVersion are detected (telemetry handshake).

🧳 Change Types¶

ConfigurationChanged (delta/snapshot)
ConfigurationDeleted
NamespacePolicyChanged (RBAC/edition rules)
SecretRotated (metadata only; values are out‑of‑band via Vault references)
BulkConfigurationChanged (batch changes carry list of keys and a single changeSetId)

🔐 Security & Integrity (Event Path)¶

mTLS for gRPC; AMQP over TLS for Service Bus; SASL/SSL for Kafka.
Claims in event metadata: sub, roles, policyHash.
Signature: optional JWS detached signature of data for tamper detection (consumers validate against ECS JWKS).
PII/Sensitive: events never include secret values — only Key Vault references & hashes.

🧠 In‑Process Hot Reload (Options Pattern)¶

ECS SDK plugs into .NET IOptionsMonitor<T>, mapping config keys to strongly‑typed options.
When an event arrives, SDK rebinds options and triggers OnChange callbacks with debounce to prevent thrashing.

services.Configure<RetryPolicyOptions>(builder => builder
    .BindFromEcs("payments", "payment.retryPolicy"));

📊 Observability for Eventing¶

Spans: ecs.publish, ecs.dispatch, ecs.consume, ecs.apply.
Metrics:
ecs_events_published_total, ecs_events_consumed_total
ecs_consumer_lag_seconds (per service/tenant)
ecs_replay_requests_total
ecs_dropped_events_total (with reason)
Logs: structured with changeSetId, configVersion, tenantId, namespace, configKey.

🗺️ Routing & Filtering¶

Subscription Filters (Service Bus):

SQL filters on user.properties.tenantId, edition, namespace, configKey.
Rule examples:
tenantId = 't-001' AND namespace = 'payments'
edition IN ('enterprise','pro') AND configKey LIKE 'feature.%'

Kafka Consumers:

Use predicate filter inside consumer callback (cheap check) before apply.

🧰 Failure Scenarios & Handling¶

Scenario	Behavior
Consumer offline	On startup → snapshot + replay from last offset/version
Schema mismatch	Send to DLQ with `schema_incompatible`; emit `CompatibilityAlert`
Long‑running apply	SDK offloads apply to worker; ack only after commit; backpressure increases
Tenant revocation	ECS publishes `AccessRevoked`; SDK drops subscriptions & clears cache

🧪 Test Matrix (Factory‑grade)¶

Contract tests: CloudEvents schema validation; signature verification.
Resilience tests: injected duplicates/out‑of‑order delivery; ensured idempotent apply.
Load tests: 10k events/min, 3k consumers; measure consumer lag & mean apply latency.
Chaos: bus outages, partial partitions; verify snapshot + replay convergence.

🧱 Reference Diagrams¶

Fan‑Out Topology¶

flowchart LR
  ECS[ECS Outbox Dispatcher] -->|CloudEvents| ASB[(Service Bus Topic)]
  ECS -->|CloudEvents| KAFKA[(Kafka Topic)]
  ASB --> A1[Service A - payments]
  ASB --> A2[Service B - billing]
  KAFKA --> A3[Service C - gateway]
  A1 --> Cache1[(Local Config Cache)]
  A2 --> Cache2[(Local Config Cache)]
  A3 --> Cache3[(Local Config Cache)]

Hold "Alt" / "Option" to enable pan & zoom

Replay & Convergence¶

sequenceDiagram
  participant Svc as Service Consumer
  participant ECS as ECS API
  participant Bus as Bus Topic

  Svc->>ECS: GET /snapshot?since=2025.08.21+00020
  ECS-->>Svc: Snapshot (Version=...22)
  Svc->>Bus: Resume subscription (offset/version=...22)
  Bus-->>Svc: Events (...23, ...24)
  Svc->>Svc: Apply in order, idempotent

Hold "Alt" / "Option" to enable pan & zoom

📦 Deliverables¶

ECS Outbox Dispatcher (.NET Worker) with ASB/Kafka providers.
CloudEvents Contracts + JSON Schema v1 (configuration-changed.json etc.).
.NET SDK: SubscribeAsync, IOptionsMonitor binding, idempotent store, snapshot+replay.
gRPC Watch Service: WatchConfig(WatchRequest) -> stream ConfigurationChangeEvent.
Ops Dashboards: consumer lag, publish rates, DLQ drilldowns.
Conformance Tests: duplication, ordering, replay convergence.

🧱 Backlog → Azure DevOps (Epics/Features/Tasks)¶

Epic: ECS Eventing Backbone (ASB/Kafka)

Feature: Outbox storage + dispatcher
Task: Outbox table & transactional write (NHibernate)
Task: Dispatcher with de‑dup by (changeSetId, etag)
Task: Retry policy & DLQ integration
Feature: CloudEvents contracts & validators
Task: JSON Schema & contract tests
Task: JWS signing & JWKS rotation
Feature: Service Bus topology as code
Task: IaC for topic, subscriptions, filters
Task: SLOs & alarms (publish/consume error rate)

Epic: ECS Consumer SDK (.NET)

Feature: Subscription API + Options binding
Task: IEcsChangeSubscriber + host extensions
Task: Idempotency store (pluggable: memory/redis/sql)
Task: Snapshot & replay client
Feature: Observability hooks
Task: OTEL spans & metrics
Task: Structured logs with changeSetId

Epic: gRPC Watch & Webhooks

Feature: WatchConfig streaming service
Task: mTLS, auth interceptors, backpressure
Feature: Webhook sender
Task: HMAC/JWS signing, retry with exponential backoff

Epic: Testing & Chaos

Feature: Contract & resilience suite
Task: Duplicate/out‑of‑order injection tests
Task: Replay convergence E2E
Task: Load test harness & baseline SLOs

✅ Acceptance Criteria¶

A config write results in a CloudEvent published to the bus within < 250 ms p50.
Consumers that are offline converge using snapshot + replay with zero drift (checksums match).
At‑least‑once guaranteed; duplicates handled without side effects (idempotent apply proven in tests).
Versioning: monotonic configVersion increases and enforced; ETag required on updates.
Telemetry shows publish/consume/lag metrics per tenant and service; DLQ has actionable reasons.

🔭 Notes & Next Steps¶

Align policy‑driven overrides with event filters (tenant/edition rules → subscription rules).
Add Config Rollback Event (ConfigurationRolledBack) with automatic replay to target version.
Prepare multi‑region event replication plan (ASB geo‑disaster recovery / Kafka MirrorMaker).

ECS becomes a reactive, version‑aware, and verifiably reliable configuration backbone for the entire ConnectSoft ecosystem.

🚀 Caching & Performance¶

🎯 Goals¶

Design a caching and performance strategy that delivers sub‑30ms p95 resolve latency at factory scale, while preserving consistency, tenancy isolation, and deterministic resolution.

In‑memory + distributed cache (Redis) with precise invalidation
Snapshot lifecycle for fast cold‑start and replay
Scalable read path (high QPS) and controlled write path (safe propagation)

🧱 Cache Architecture (multi‑tier)¶

Tier	Scope	What it stores	TTL / Consistency	Purpose
L0 – Process cache	Per API pod	Resolved blobs by `(tenantId, env, namespace, selector)` + ETag	Short TTL (e.g., 3–10s) + SWR	Nanosecond access for hot prefixes
L1 – Redis	Cluster‑wide	Resolved blobs + prefix indexes; also recent snapshots	TTL 30–120s (per key) + explicit invalidation	Cross‑pod sharing, cuts DB pressure
L2 – DB	SQL/Cosmos	Authoritative bundles/versions	Strong read	Source of truth for cache misses

Key prefixes

ecs:v1:resolved:{tenant}:{env}:{ns}:{selector} -> {etag, version, json}
ecs:v1:index:{tenant}:{env}:{ns}              -> list of keys / etags
ecs:v1:snapshot:{tenant}:{env}:{ns}:{ver}     -> frozen state (blob id/ref)

🔁 Read Path (hot/cold)¶

sequenceDiagram
  participant S as Service SDK
  participant API as ECS API
  participant L0 as L0 Cache (memory)
  participant R as Redis (L1)
  participant DB as Store (SQL/Cosmos)
  S->>API: Resolve(tenant, env, ns, selector, If-None-Match)
  API->>L0: TryGet(etag)
  alt Hit
    L0-->>API: Resolved + etag
  else Miss
    API->>R: GET ecs:v1:resolved:...
    alt Hit
      R-->>API: Resolved + etag
      API->>L0: Put
    else Miss
      API->>DB: Query bundles (G→E→T→S→tags)
      API->>R: Set resolved + index
      API->>L0: Put
    end
  end
  API-->>S: 200/304 with etag

Hold "Alt" / "Option" to enable pan & zoom

Techniques

ETag + 304 to avoid payload transfer
SWR (stale‑while‑revalidate): serve slightly stale value while refreshing in background
Selector compaction: normalize selectors (e.g., sorted tags) to maximize hit rate

🧨 Invalidation & Coherency¶

On publish:

Write transaction commits new version
Outbox → Event Bus emits ConfigurationChanged
Cache invalidator:
Evict ecs:v1:resolved:* keys whose prefix intersects the changed key/namespace
Bump namespace index to force L0 recalc on next read

Precision rules

Maintain key→prefix map (stored in Redis) to target minimal eviction set
Protect against stampedes with singleflight locks per key

🧊 Snapshot Lifecycle¶

Stage	Detail
Create	On schedule (e.g., hourly) or on demand; compute resolved state for `(tenant, env, ns)`; store in Blob; index pointer in Redis `ecs:v1:snapshot:*`
Use	Cold start or replay → fetch latest snapshot pointer, hydrate L1/L0 quickly
Rotate	Keep last N (e.g., 72 hourly) per scope; delete older (configurable per edition)
Validate	Checksums of snapshot vs live resolution to detect drift
Promote	During incidents, mark a snapshot as fallback; read path returns snapshot when DB/Bus degraded

API (read):

GET /snapshot?tenantId&env&ns[&version] → blob ref + checksum
SDK can request snapshot on boot, then subscribe to events

⚖️ Scaling Reads vs Writes¶

Reads (dominant):

Scale API pods horizontally (HPA on RPS/latency)
Redis clustering (hash slot spreading on tenant|env|ns)
Hot‑tenant sharding: per‑tenant Redis db/index when necessary
Compression (LZ4) for large resolved blobs to reduce network

Writes (controlled):

Gate through Admin API with If‑Match and idempotency keys
Serialize heavy publish waves per namespace (per‑prefix semaphore)
Outbox throughput scaling with workers; batch invalidations

🧪 Performance Targets¶

p95 resolve latency: ≤ 30 ms (hot cache), ≤ 150 ms (cold)
Cache hit ratio: ≥ 85% overall; ≥ 95% for top 20 namespaces
Publish → visible (watchers): ≤ 3 s p95
Redis ops: P50 < 2 ms; 0 timeouts under 99.9^th percentile
DB read QPS reduction: > 90% vs no‑cache baseline

🧰 Protection & Resilience¶

Thundering herd guard: per‑key singleflight + jittered backoff
Adaptive TTLs: increase TTL when bus lag detected to protect DB
Read‑degrade mode: serve last known snapshot on DB outage window
Rate limiting: per‑tenant read QPS caps; publish rate caps
Circuit breakers: around Redis/DB; fallback chain L0→snapshot

👩‍🔧 Tuning Playbook¶

Low hit ratio?
Check selector cardinality; introduce prefix bucketing or coarser namespaces
Enable response compression for large maps
High Redis CPU?
Increase sharding; switch to key hashing on tenant|ns
Raise L0 TTL modestly (5→10s) for hottest endpoints
Stampedes after publish?
Stagger invalidation; use incremental delta apply for L0 refresh
Limit concurrent cold resolvers with a token bucket

🗺️ Diagram — Cache & Snapshot Topology¶

flowchart LR
  API[Config API]:::svc -- L0 get/set --> L0[(Process Cache)]
  API -- get/set --> R[(Redis L1 Cluster)]
  API -- miss --> DB[(SQL/Cosmos)]
  Snap[Snapshotter]:::wrk -- resolved->blob --> Blob[(Snapshots)]
  Snap -- indexes --> R
  Pub[Publisher]:::wrk -- events --> Bus[(Event Bus)]
  Inv[Invalidator]:::wrk -- targeted evict --> R
  R -- hydrate --> L0

  classDef svc fill:#E0F2FE,stroke:#38BDF8;
  classDef wrk fill:#FFE4E6,stroke:#FB7185;

Hold "Alt" / "Option" to enable pan & zoom

✅ Acceptance Criteria¶

AC‑1: Writes invalidate only affected prefixes; unrelated namespaces retain ≥ 95% hit ratio.
AC‑2: Under 5k RPS sustained, p95 read latency ≤ 30 ms (hot), error rate < 0.5%.
AC‑3: During DB outage drills (5 min), API serves from snapshots with no 5xx spikes, and emits degraded mode metric.
AC‑4: Snapshot/restore produces byte‑identical resolved maps to live resolution for the same version.
AC‑5: Cache stampede tests show singleflight success (no >3x concurrent DB hits for same key).

🔧 Backlog → Azure DevOps¶

Epic: L0/L1 Cache & Invalidation

Feature: L0 in‑proc cache + SWR
Feature: Redis schema & prefix strategy
Feature: Precision invalidation worker
Tasks: key maps, singleflight, compression, TTL policy

Epic: Snapshot Lifecycle

Feature: Snapshotter worker & checksums
Feature: Fallback mode & API
Tasks: Blob storage model, retention policy, drift checker

Epic: Performance & Scale

Load test harness (Locust/K6)
Cache hit & latency dashboards
Auto‑tuning hooks (adaptive TTL, backoff)

📌 Notes¶

Consider optional edge cache (sidecar or local Redis) for ultra‑low latency per node.
For very large tenants, introduce namespace partitioning and hierarchical snapshots (per sub‑namespace).

ECS’s read path becomes fast, predictable, and resilient, and the write path safely fans out with precise cache coherence.

🛡️ Resiliency & Fault Tolerance¶

Goal: ensure ECS continues to serve safe, correct, and timely configuration under partial failures, regional incidents, and dependency degradation — without violating tenant isolation, security policies, or observability guarantees.

🎯 Objectives¶

Survive transient and regional faults with graceful degradation (read-mostly mode).
Protect dependent services via bounded retries, circuit breakers, bulkheads.
Fail fast on unsafe paths; serve stale-but-safe configs when allowed by policy.
Prove resilience with chaos drills, SLAs/SLOs, and automated failover runbooks.

🧨 Failure Model (ECS)¶

Surface	Typical Faults	Primary Mitigations
Read path (GET config)	Redis miss, origin DB latency, network partitions	Local snapshot, Redis cluster, hedged reads, timeouts
Write path (PUT/PATCH)	DB leader loss, quorum fail, version conflict	Optimistic concurrency, idempotent writes, queue-backed commit
Eventing (change notifications)	Service Bus/Kafka outage, consumer lag	Outbox + retry, backfill from ledger, idempotent subscribers
AuthN/Z	IdP outage, token validation latency	Token cache, STS fallback keys, mTLS pinning
Secrets	Key Vault throttling	Per-tenant secret cache, exponential backoff, jitter
Region	AZ/region failure	Active–active reads, active–standby writes, DNS/AFD failover

🧱 Resilience Architecture¶

flowchart LR
  Client((Service)) -->|gRPC/REST| Edge[ECS API Gateway]
  Edge --> CB{Circuit\nBreaker}
  CB --> L1[(In-Proc Snapshot Cache)]
  L1 -->|miss| L2[(Redis Cluster)]
  L2 -->|miss| Origin[(Cosmos DB / Postgres Primary)]
  Origin --> Ledger[(Change Ledger / Event Outbox)]
  Ledger --> Bus[(Service Bus / Kafka)]
  Bus --> Subscribers[[Microservices/Agents]]

  subgraph Regional Pair
    Origin---OriginReplica[(Geo-Replicated Read)]
    Redis[(Redis)]---RedisReplica[(Geo-Replica)]
  end

Hold "Alt" / "Option" to enable pan & zoom

Reads: L1 (in-proc) → L2 (Redis) → Origin (Cosmos/PG). Serve stale snapshot within TTL if origin slow.
Writes: Single-writer per partition (tenant/namespace). Outbox pattern persists change, publishes event.
Change propagation: At-least-once from outbox → bus; consumers are idempotent (version vector, ETag).

🗂️ Fallback & Degradation Policies¶

Scenario	ECS Behavior	Notes
Origin slow (>p95 threshold)	Stale-OK read from Redis or local snapshot if `policy.allowStale=true` and TTL valid	Emit `degraded_mode=true` metric, add `X-Config-Stale-Age`
Redis unavailable	Skip L2, fall back to local snapshot; increase hedged read to origin	Shorter timeouts to avoid threadpool exhaustion
Bus outage	Writes commit to origin + outbox table; background publisher replays when bus recovers	Consumers dedupe by `(tenantId, key, version)`
Tenant-scoped incident	Trip per-tenant breaker; serve last good tenant snapshot; block writes for tenant	Prevents blast radius
Global auth outage	Honor cached token validations (bounded TTL); keep mTLS; reduce JWK rotations temporarily	Never bypass RBAC

⚙️ .NET Resilience Profile (Polly)¶

services.AddHttpClient("EcsOrigin")
    .AddPolicyHandler(Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromMilliseconds(300)))
    .AddPolicyHandler(Policy<HttpResponseMessage>.Handle<Exception>()
        .WaitAndRetryAsync(3, retry => TimeSpan.FromMilliseconds(50 * Math.Pow(2, retry)) + Jitter()))
    .AddPolicyHandler(Policy<HttpResponseMessage>.Handle<Exception>()
        .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)));

static TimeSpan Jitter() => TimeSpan.FromMilliseconds(Random.Shared.Next(0, 40));

Bulkheads per endpoint to cap concurrent origin calls.
Hedged requests (optional): fire a second read to replica after p95 latency.

🧾 Config Snapshot Lifecycle¶

Acquire: On successful read from origin, build ConfigSnapshot { tenantId, ns, version, etag, data, capturedAt }.
Store: L1 (memory) + L2 (Redis, key: cfg:{tenant}:{ns}:{version}) with TTL & size guard.
Serve: Prefer latest version; if origin fails, use highest valid within policy TTL.
Invalidate: On change event → purge L1, update L2; add etag to prevent stale overwrite.
Audit: Record snapshot usage (fresh vs stale) with traceId, tenantId.

🔁 Idempotency & Versioning¶

Writes: Require If-Match: <ETag> or version. On conflict → 409 with latest pointer.
Events: Include (aggregateId, version); consumers ignore if already applied.
Replay: On recovery, publisher replays outbox by createdAt and notified=false.

🧪 Chaos Testing Scenarios¶

Area	Fault Injection	Expected Outcome
Cache	Kill Redis primary; network partition	Stale reads from L1; perf dip within SLO; no 5xx spikes
Origin	500/timeout storm; leader failover	Stale-OK reads; write breaker trips → read-only mode banners
Bus	Drop topic; throttle	Outbox backlog grows; No lost events; catch-up within RTO
Auth	JWK endpoint down	Cached keys honored; no auth bypass
Region	Simulate regional fail	Traffic manager to secondary reads; write drain & promote in ≤ RTO

Schedule GameDays monthly; record hypotheses, metrics, and remediations.

📈 SLOs & SLA Envelope (Proposed)¶

Dimension	Target	Notes
Availability (reads)	99.99% monthly	With stale-OK serving
Availability (writes)	99.9% monthly	May block during conflict/region failover
p95 read latency	≤ 20ms (cache hit), ≤ 120ms (origin)	Per-tenant
Event propagation	≤ 3s p95	From commit to first consumer delivery
RPO	≤ 30s	Config ledger + cross-region replication
RTO (regional)	≤ 15 min	Promote secondary; re-point writers
Error budget (reads)	4m 22s / month	Tracked in SRE dashboard

Contracts: SLA doc states credit policy if monthly availability below target; per-tenant SLOs are observable (dashboards & reports).

🌍 Failover Architecture¶

Reads: Active–Active (multi-region replicas for origin + Redis replica); DNS/AFD/Envoy locality.
Writes: Active–Standby per partition (tenant/namespace). Single-writer enforced by lease (Cosmos) or advisory lock (PG).
Data:
Cosmos DB: multi-region write (optional) with conflict resolver = highest version.
PostgreSQL: logical replication; promote with pg_auto_failover; ensure write fences during switchover.
Secrets: Key Vault geo-redundant, soft-delete + purge protection; client caches secrets with TTL.

sequenceDiagram
  participant Client
  participant ECS
  participant Redis_Primary
  participant DB_Primary
  participant DB_Secondary

  Client->>ECS: GET /config
  ECS->>Redis_Primary: TryGet
  alt miss/timeout
    ECS->>DB_Primary: Read
    DB_Primary--xECS: timeout
    ECS->>DB_Secondary: Hedged Read
    ECS-->>Client: 200 (stale-ok), X-Config-Stale-Age
  else hit
    ECS-->>Client: 200 (fresh)
  end

Hold "Alt" / "Option" to enable pan & zoom

🧩 Read/Write Mode Matrix¶

Mode	Reads	Writes	Trigger	Recovery Signal
Normal	Fresh	Allowed	Healthy deps	N/A
Degraded	Stale-OK	Allowed	Origin latency > threshold	Latency normalizes
Read-Only	Stale-OK	Blocked	Origin unavailable; version conflicts	DB healthy + catch-up complete
Failover	Fresh via secondary	Allowed after promote	Region incident	Health gates + leader elected

🔔 Telemetry & Alerts (Resilience Signals)¶

Counters: config_stale_served_total{tenantId}, outbox_backlog_size, breaker_open_total{scope}.
Gauges: snapshot_age_seconds, publish_lag_seconds.
SLO: burn-rate alerts 2%/1h and 5%/6h on read availability.
Events: DegradedModeEntered, ReadOnlyModeEntered, FailoverStarted, FailoverCompleted.

🧰 Ops Runbooks (abridged)¶

Degraded Mode Identify hotspot tenants → increase per-tenant TTL; ensure Redis healthy; verify breaker state.
Write Conflicts Inspect conflict keys; return 409 with latest ETag; advise client retry using backoff.
Bus Backlog Scale publishers/partitions; verify outbox replay; validate consumer idempotency.
Regional Failover Freeze writers; promote secondary; run consistency checks; unfreeze by partition.

🔐 Safety & Compliance¶

Stale reads respect edition/tenant policy overlays; never cross-tenant.
No secret material in snapshots; secrets are references resolved at call time with cache TTL.
All degraded/RO decisions are audited (who, why, since).

✅ Deliverables¶

Resilience ADRs: cache-first reads, outbox + at-least-once, active–standby writes.
.NET resilience profile (Polly) and configuration schema: resilience: { allowStale: true, staleTtlSec: 120, readTimeoutMs: 300, retries: 3, breaker: { failCount: 5, breakSec: 30 } }
Chaos plan & scripts (fault maps), SLO dashboards, runbooks.

🔜 Epics / Azure DevOps (seed)¶

ECS-RES-01: Implement snapshot cache (L1/L2) with TTL & audit.
ECS-RES-02: Outbox publisher + idempotent consumer SDK.
ECS-RES-03: Circuit breaker/bulkhead/hedged reads middleware.
ECS-RES-04: Degraded/Read-only mode controller + health endpoints.
ECS-RES-05: Chaos suite & monthly GameDay pipeline.
ECS-RES-06: Multi-region failover automation + runbooks.
ECS-RES-07: SLO burn-rate alerts & resilience dashboard.

With these guards, ECS remains predictable under pressure, protecting downstream services while providing observability and control to ops and agents.

🔄 Deployment, CI/CD & DevOps Enablement¶

🎯 Goals¶

Provide a repeatable, secure, observable path to deliver ECS across dev → test → staging → prod, with immutable packaging, automated rollouts/rollbacks, and zero‑downtime upgrades.

📦 Packaging (Artifacts & IaC)¶

Artifact Type	Tooling	Purpose	Notes
Container Image	Dockerfile → ACR	Runtime unit for API/Workers	Signed, SBOM attached, Trivy‑scanned
K8s Charts	Helm	ECS components (API, Streamer, Workers)	Values overlays per env/tenant
Azure Infra	Bicep / Terraform / Pulumi	AKS, ACR, Key Vault, SQL/PG, Redis, Service Bus	Policy‑as‑code (Azure Policy/OPA)
Migrations	EF Core / Flyway	DB schema & data migrations	Forward‑only + rollback plan
Release Bundle	`.tar.gz`	Helm chart + values + migration scripts + release notes	SemVer tag (e.g., `ecs-1.6.3`)

Helm values overlays

/deploy/helm/values-dev.yaml
/deploy/helm/values-test.yaml
/deploy/helm/values-staging.yaml
/deploy/helm/values-prod.yaml

🔁 Dev/Test/Prod Rollout¶

flowchart TD
  A[Commit: src + IaC] --> B[CI: Build & Scan]
  B --> C[Push: ACR + Chart Repo]
  C --> D[CD: Dev Deploy]
  D --> E[Smoke + Contract Tests]
  E --> F[Test/Staging Deploy]
  F --> G[Perf & Chaos Gates]
  G --> H[Manual Approval]
  H --> I[Prod: Blue/Green or Canary]
  I --> J[Post‑deploy Verification + Auto Rollback if failing]
  J --> K[Tag + Audit + Release Notes]

Hold "Alt" / "Option" to enable pan & zoom

Rollout strategies

Blue/Green for API pods (Envoy/AGIC switch on health).
Canary (e.g., 10% → 50% → 100%) guarded by SLO checks (p95 latency, error rate, consumer lag).
KEDA scales workers by outbox depth / schedule (snapshotter).

🧬 Versioning & Migration Automation¶

Semantic Versioning

MAJOR.MINOR.PATCH for ECS; backward‑compatible APIs within MINOR.
Configuration schema versions tracked in repo (JSON Schema URIs).

DB Migrations

EF Core/Flyway run as pre‑install Helm hooks:
hook: pre-install, pre-upgrade → migration-job
Idempotent; write migration ledger to SQL.
On failure: abort upgrade, auto‑rollback to previous chart, publish incident event.

Config Schema Evolution

Contracts validated at publish time; compatibility checks (breaking change requires force=true + elevated RBAC).
Data backfills via worker job (post‑upgrade hook), observable via ecs_backfill_pending.

Zero‑Downtime Policy

API pods roll with maxUnavailable=0.
Sticky read cache retained; consumers use ETag + watchers to avoid reload storms.

🧪 CI/CD Pipelines (Azure DevOps YAML — excerpt)¶

stages:
- stage: CI
  jobs:
  - job: build
    steps:
    - task: Docker@2
      inputs: { command: buildAndPush, repository: ecs/api, tags: $(Build.BuildNumber) }
    - task: HelmInstaller@1
    - script: helm lint deploy/helm/ecs
    - task: TrivyScan@1
    - task: PublishBuildArtifacts@1

- stage: CD_Dev
  dependsOn: CI
  jobs:
  - deployment: dev
    environment: ecs-dev
    strategy:
      runOnce:
        deploy:
          steps:
          - script: helm upgrade --install ecs deploy/helm/ecs -f deploy/helm/values-dev.yaml --set image.tag=$(Build.BuildNumber)
          - script: ./ops/smoke.sh https://ecs-dev.internal

- stage: CD_Prod
  dependsOn: CD_Staging
  approval: Manual
  jobs:
  - deployment: prod
    environment: ecs-prod
    strategy:
      canary:
        increments: [10,50,100]
        deploySteps:
        - script: helm upgrade --install ecs ...
        - script: ./ops/verify-slo.sh --latency-p95 30 --errors 0.5
        onFailure:
        - script: ./ops/rollback.sh

🔒 Security & Policy Gates (shift‑left)¶

Image & chart scanning (Trivy/Grype) — block on HIGH/CRITICAL.
IaC checks (Checkov/OPA) — deny insecure networking, public KV, missing TLS.
Secrets: all values sourced from Key Vault references; no secrets in values files.
Sign & verify: Cosign images; Helm chart provenance (.prov).

📊 Release Observability¶

Pipeline emits deployment events with traceId, version, change set.
Golden signals checked during canary: ecs_resolve_latency_ms p95, 5xx rate, ecs_event_fanout_lag_ms.
Auto‑rollback if thresholds breached for 5–10 minutes window.
Release notes generated from commits + PR labels (features/fixes/breaking).

🗂️ Repo & Environment Layout (suggested)¶

/src/ecs-api
/src/ecs-workers
/sdk/dotnet
/deploy/helm/ecs
/deploy/bicep|tf|pulumi
/ops/scripts (smoke, verify-slo, rollback, snapshot-restore)
/migrations (db, schema)
/docs/hld

✅ Acceptance Criteria¶

AC‑1: One‑button pipeline promotes dev → test → staging → prod with signed artifacts and policy gates.
AC‑2: Zero‑downtime upgrade verified (no 5xx spikes > 0.5% and p95 ≤ targets during rollout).
AC‑3: Failed canary auto‑rolls back to last healthy release; audit entries recorded.
AC‑4: Migrations are idempotent, logged, and observable; rollback plan documented and tested.
AC‑5: All secrets consumed via Key Vault references; pipeline blocks on any hardcoded secret detection.

🔧 Backlog → Azure DevOps¶

Epic: Packaging & IaC

Feature: Helm chart + env overlays
Feature: Bicep/Terraform/Pulumi modules for AKS, Redis, SQL/PG, ASB, KV
Task: Chart provenance + image signing

Epic: Pipelines & Gates

Feature: CI build/scan/sign
Feature: CD with canary/blue‑green + auto‑rollback
Feature: SLO verifiers & deployment events

Epic: Migrations & Schema Evolution

Feature: Migration runner (hooks) + ledger
Feature: Config schema compat checker
Feature: Backfill job + observability

Epic: Ops Tooling

Feature: Smoke/verify/rollback scripts
Feature: Release notes generator
Feature: Disaster‑recovery playbook (snapshot restore)