Skip to content

🛠 External Configuration Server

🎯 Vision

The External Configuration Server (ECS) will serve as a centralized, multi-tenant, secure, and dynamic configuration service for all microservices in the ConnectSoft AI Factory ecosystem. It allows runtime reconfiguration without redeployment, ensures tenant isolation, and enforces policy-driven overrides across editions and environments.

ECS transforms configuration from a static file-driven activity → into a policy-aware, observable, and externalized runtime service.


🌍 Problem Context

Traditional microservices suffer from:

  • 📄 Config files baked into deployments (requiring restarts on change)
  • 🌐 Scattered config sources (env vars, files, feature flags, secrets)
  • No central observability (difficult to trace config changes)
  • 🧩 Multi-tenant conflict (different tenants require different policies)

In a factory-scale system with 3000+ agents and microservices, this creates risk, inconsistency, and operational drag.


🧩 Purpose of ECS

ECS is designed to:

Objective ECS Responsibility
Centralize configuration Single source of truth for runtime and environment variables
Enable dynamic updates Push config changes at runtime without redeployment
Enforce multi-tenancy & editions Per-tenant and per-edition overrides with RBAC enforcement
Ensure observability Full audit trail, metrics, and tracing of configuration requests & mutations
Integrate with ecosystem Provide APIs, gRPC, and event-driven notifications for microservices and agents

🏗 Position in Architecture

ECS acts as an infrastructure microservice in the ConnectSoft ecosystem. It is consumed by:

  • All domain microservices → for runtime configs
  • Architect Agents → for blueprint-level config references
  • DevOps Orchestrators → for rollout, environment management, and secrets injection
  • QA Cluster → for validating edition/role-specific behavior under config variants
  • Studio UI → for editing and visualizing tenant-specific config

🧬 High-Level Diagram – ECS in Factory

flowchart TD
    subgraph Factory Microservices
        A1[Service A] --> ECS
        A2[Service B] --> ECS
        A3[Service C] --> ECS
    end

    subgraph Core Infrastructure
        ECS[External Configuration Server]
        Vault[(Secrets / Key Vault)]
        Bus[(Event Bus)]
    end

    subgraph Agents
        ArchitectAgent --> ECS
        DevOpsAgent --> ECS
        QAAgent --> ECS
        StudioUI --> ECS
    end

    ECS --> Vault
    ECS --> Bus
Hold "Alt" / "Option" to enable pan & zoom

📘 Principles

  1. DDD-Centric: Config treated as a domain object with aggregates (TenantConfig, EditionConfig).
  2. Clean Architecture: Ports/adapters for REST, gRPC, Event Bus.
  3. Event-Driven: Publish config changes as CloudEvents.
  4. Observability-First: All reads/writes are traceable with OpenTelemetry.
  5. Security-Hardened: Role-based access, integration with OpenIddict/Azure AD.
  6. Cloud-Native: Designed for AKS, scaling with caching + distributed persistence.

The External Configuration Server will become the backbone of runtime flexibility in the ConnectSoft AI Factory, enabling:

  • 🔄 Safe, dynamic reconfiguration
  • 🏢 Tenant- and edition-aware overrides
  • 📊 Observability and auditability
  • ⚡ Reliable integration via APIs, gRPC, and event streams

Without ECS, the Factory remains brittle, environment-dependent, and costly to scale across tenants.


🛠 Functional & Non-Functional Requirements

📋 Functional Requirements

The External Configuration Server (ECS) must deliver a comprehensive feature set to support the ConnectSoft Factory’s scale and principles.

Area Functional Requirement Description / Notes
Configuration Management CRUD for Config Objects Create, update, delete, and retrieve configuration entities (TenantConfig, EditionConfig, ServiceConfig).
Hierarchical Config Resolution Resolve configuration in layers: Global → Edition → Tenant → Service. Supports overrides and fallbacks.
Versioned Config All configurations are versioned to allow rollbacks, diffs, and audit trails.
Dynamic Updates Config changes are pushed in real time via gRPC streams / Event Bus.
Environment-Aware Config Separate resolution paths for Dev, Test, Staging, Production.
Multi-Tenancy Tenant Isolation Each tenant’s configuration is logically and physically isolated.
Edition Overrides Editions can apply overrides at the feature or property level.
Access & Security Role-Based Access Control (RBAC) Fine-grained policies: who can read, write, or override configs.
Integration with Identity Providers Supports OpenIddict and Azure AD for authentication and authorization.
Secret Handling Sensitive values delegated to Key Vault/Secrets Manager (not stored in ECS DB).
APIs & Interfaces REST API CRUD + query endpoints.
gRPC Interface High-performance communication for microservices.
Event Streaming Publish config changes as CloudEvents to Kafka/Service Bus/NATS.
Admin UI / Studio Integration Visualization and management of configs.
Observability Full Audit Logging Every read/write recorded with user/service identity and traceId.
Metrics Config read/write latency, cache hit/miss, per-tenant stats.
Distributed Tracing OpenTelemetry traces link config requests with service spans.

📊 Non-Functional Requirements

Category Requirement Target / Notes
Scalability Horizontal scaling ECS must handle thousands of config reads/sec across 3000+ services.
Caching Layer Configs cached in-memory + distributed cache (Redis) with invalidation on change.
Availability High availability ≥99.9% SLA, active-active across regions.
Zero-downtime updates Config server upgrades must not disrupt consumers.
Performance Low-latency reads <50ms per config resolution under load.
High throughput Support ≥20K config resolutions/min per cluster.
Security Data encryption TLS 1.3 in transit, AES-256 at rest.
Secret isolation No secrets persisted; all resolved via Vault/Key Vault.
Reliability Idempotency Config change events deduplicated using semanticHash + traceId.
Retry mechanisms Client SDKs and server ensure at-least-once delivery.
Compliance & Audit Full traceability Retain config change history for ≥2 years.
Policy enforcement Built-in compliance checks (e.g., no empty values for critical configs).

🧬 Diagram – Config Resolution Layers

flowchart TB
    Global[Global Config] --> Edition[Edition Config]
    Edition --> Tenant[Tenant Config]
    Tenant --> Service[Service Config]
    Service --> Final[Resolved Runtime Config]

    Final -->|Delivered to| Microservice
Hold "Alt" / "Option" to enable pan & zoom

Interpretation:

  • Global defaults apply first.
  • Editions override global.
  • Tenants override editions.
  • Services override tenants.
  • Result = fully resolved, policy-compliant runtime config.

The External Configuration Server must support multi-level, versioned, observable, and secure configuration management at scale. Key requirements include:

  • Functional: CRUD, versioning, overrides, real-time updates, multi-tenant enforcement.
  • Non-Functional: High availability, low-latency reads, strong security, full audit trail.

These requirements form the acceptance criteria for the PRD, guiding architecture and implementation decisions.


🛠 Domain Model & DDD Aggregates

🧩 Domain-Driven Design Context

The ECS is not just a key-value store; it is a domain service where configuration is a first-class aggregate with policies, lifecycle, and traceability. We treat configuration as structured, multi-level domain objects that evolve over time.


🏗 Key Aggregates & Entities

Aggregate / Entity Description Key Properties Relationships
TenantConfig (Aggregate Root) Represents the configuration context of a specific tenant. TenantId, ConfigSet, Overrides, CreatedBy, UpdatedBy, Version Contains multiple ServiceConfigs; inherits defaults from EditionConfig.
EditionConfig (Aggregate Root) Captures configuration defaults for a specific product edition (e.g., Basic, Pro, Enterprise). EditionId, ConfigSet, Overrides, PolicyRules, Version Linked to many TenantConfigs; overrides GlobalConfig.
GlobalConfig (Aggregate Root) Global system-wide defaults applied before edition or tenant overrides. GlobalId, DefaultValues, Policies, Version Parent to EditionConfig.
ServiceConfig (Entity) Service-specific configuration for a tenant (e.g., BillingServiceConfig). ServiceId, ConfigItems, Version Child of TenantConfig.
ConfigItem (Value Object) Atomic key-value pair with metadata. Key, Value, Type, IsSecret, IsFeatureToggle Immutable; part of ServiceConfig or higher config.
ConfigVersion (Entity) Snapshot of configuration state at a point in time. VersionId, Timestamp, ChangeLog, Hash Belongs to any Config Aggregate for rollback.
PolicyRule (Entity) Validation and enforcement rules (e.g., “must not be empty”, “allowed range”). RuleId, Scope, Expression, Severity Applied during config creation/updates.
AuditLog (Entity) Records all reads/writes with full context. LogId, Actor, Action, TraceId, Timestamp Linked to TenantConfig/EditionConfig changes.

📚 Relationships in ECS

erDiagram
    GlobalConfig ||--o{ EditionConfig : defines
    EditionConfig ||--o{ TenantConfig : appliesTo
    TenantConfig ||--o{ ServiceConfig : contains
    ServiceConfig ||--o{ ConfigItem : holds
    TenantConfig ||--o{ ConfigVersion : snapshots
    TenantConfig ||--o{ AuditLog : records
    PolicyRule ||--o{ ConfigItem : validates
Hold "Alt" / "Option" to enable pan & zoom

⚖️ Aggregate Responsibilities

  • GlobalConfig – foundation of defaults across the ecosystem.
  • EditionConfig – edition-specific overrides and policies.
  • TenantConfig – tenant-specific configuration scope, isolation, and service configs.
  • ServiceConfig – service-level overrides within a tenant.
  • ConfigItem – immutable value objects ensuring integrity.
  • ConfigVersion – ensure traceability, rollback, and deterministic replays.
  • AuditLog – maintain compliance and accountability.
  • PolicyRule – ensure configs remain valid and secure.

📘 Domain Events

ECS aggregates will emit domain events (later mapped to CloudEvents):

  • ConfigCreated
  • ConfigUpdated
  • ConfigDeleted
  • ConfigVersioned
  • ConfigRollbackPerformed
  • PolicyViolationDetected

These events feed event sourcing, observability, and downstream automation (QA, DevOps, Agents).


We’ve defined the domain model for ECS using DDD principles:

  • Aggregates: GlobalConfig, EditionConfig, TenantConfig
  • Supporting entities: ServiceConfig, ConfigVersion, PolicyRule, AuditLog
  • Value objects: ConfigItem
  • Domain events for traceability

This structure ensures clear boundaries, scalability, and auditability of configurations across 3000+ services.


🛠 High‑Level Architecture (HLA)

🎯 Architecture Goals & Constraints

Goals

  • Low‑latency, highly available config reads for 3k+ services
  • Deterministic resolution across Global → Edition → Tenant → Service
  • Real‑time change propagation (streaming + events)
  • Tenant/edition isolation with RBAC and full auditability
  • Cloud‑native deployment (AKS), observability‑first, security‑first

Constraints

  • No secrets at rest in ECS (resolve by reference to Key Vault)
  • Clean Architecture + DDD boundaries
  • Event‑driven integration (CloudEvents over MassTransit/Azure Service Bus)
  • Idempotent writes and cache‑safe reads

🧱 Clean Architecture Layers (Ports & Adapters)

Layer Responsibilities Examples (ECS)
Domain Aggregates, policies, invariants, domain events TenantConfig, EditionConfig, PolicyRule, ConfigVersion
Application Use cases, orchestration, transactions ResolveConfig, PublishChange, CreateSnapshot, Rollback
Interfaces (Adapters) REST, gRPC, events, CLI/Admin UI adapter Controllers, gRPC services, Event publisher/subscriber
Infrastructure Persistence, cache, secrets, bus, telemetry NHibernate/Azure SQL, Redis, Blob, Key Vault client, MassTransit

🗺️ Component Topology (Logical)

graph TD
  subgraph Clients
    AppSDK[Client SDK (.NET/JS/Java)]
    Studio[Studio/Admin UI]
  end

  subgraph ECS Core
    API[Config API (REST/gRPC)]
    AdminAPI[Admin API]
    Resolver[Resolver Service]
    Policy[Policy Engine]
    Notifier[Change Notifier]
    Streamer[Watch/Streamer]
    Snapshotter[Snapshot & Archive Worker]
    Projector[Audit Projector]
  end

  subgraph Data & Infra
    SQL[(Azure SQL / PostgreSQL)]
    Redis[(Redis Cache)]
    Blob[(Blob Storage: Snapshots)]
    KV[(Azure Key Vault)]
    Bus[(Azure Service Bus / Kafka via MassTransit)]
    OTEL[(OTel Exporters)]
  end

  AppSDK-->API
  Studio-->AdminAPI
  API-->Resolver
  Resolver-->Policy
  Resolver--hot read-->Redis
  Resolver--strong read-->SQL
  Resolver--secret refs-->KV
  AdminAPI--writes-->SQL
  AdminAPI--emit events-->Bus
  Notifier--fanout-->Bus
  Streamer--server push-->AppSDK
  Snapshotter--store-->Blob
  Projector--audit views-->SQL
  API--telemetry-->OTEL
  AdminAPI--telemetry-->OTEL
Hold "Alt" / "Option" to enable pan & zoom

🔁 Core Execution Flows

1) Read/Resolve (hot path)

sequenceDiagram
  participant S as Service (Client SDK)
  participant API as Config API
  participant R as Resolver
  participant C as Redis Cache
  participant DB as SQL Store
  participant KV as Key Vault

  S->>API: GET /v1/config/resolve?tenant&env&app&prefix
  API->>R: Resolve(request, principal, scopes)
  R->>C: TryGet(prefix, etag)
  alt CacheHit
    C-->>R: Resolved payload + etag
  else Miss/Stale
    R->>DB: Query hierarchical config (G→E→T→S)
    R->>KV: Resolve secret references (if any)
    R->>C: Set(prefix, resolved, etag, TTL)
  end
  R-->>API: Resolved config + etag
  API-->>S: 200 OK (ETag, traceId)
Hold "Alt" / "Option" to enable pan & zoom

Notes

  • ETag for client‑side caching; If‑None‑Match supported.
  • Secret values never stored in SQL/Redis — only resolved at read via Key Vault.

2) Write/Publish (admin path)

sequenceDiagram
  participant Admin as Admin UI/CLI
  participant AAPI as Admin API
  participant DB as SQL
  participant Policy as Policy Engine
  participant Bus as Event Bus
  participant C as Redis

  Admin->>AAPI: POST /v1/config/{scope} (If-Match: etag)
  AAPI->>Policy: Validate (syntax, rules, scope, RBAC)
  Policy-->>AAPI: OK/Violation
  alt OK
    AAPI->>DB: Upsert aggregate (new version)
    AAPI->>Bus: Publish ConfigUpdated(cloudevent)
    AAPI->>C: Invalidate affected prefixes
    AAPI-->>Admin: 202 Accepted (versionId)
  else Violation
    AAPI-->>Admin: 400/409 with details
  end
Hold "Alt" / "Option" to enable pan & zoom

Notes

  • Optimistic concurrency via ETag/Version.
  • Idempotency key (semanticHash, traceId) to dedupe repeated submissions.

3) Watch/Streaming updates

  • gRPC streaming or SSE/WebSocket.
  • Server pushes diffs and new etag/version when relevant keys change.
  • SDK applies delta and refreshes local cache; raises typed callbacks.

4) Snapshot & Rollback

  • Snapshotter runs scheduled jobs or on‑demand to persist full resolved views to Blob.
  • Rollback creates a new version from a prior snapshot (never mutates history).
  • Projector maintains audit/read models for Studio timelines.

🧠 Tenancy, Editions & Scoping

  • Scope model: Global → Edition → Tenant → Environment → Service → Tag(s)
  • Deterministic precedence; last‑write wins at same level with version guards.
  • All queries filtered by tenantId, edition, environment, and RBAC scopes.
  • Policy Engine enforces forbidden prefixes and required keys per scope.

🚀 Performance & Caching Strategy

  • Read‑optimized: Redis front‑cache for prefixes/namespaces; stale‑while‑revalidate.
  • Write invalidation: precise key‑space eviction (prefix wildcards).
  • Batched DB queries with computed overlay merge at the server (single roundtrip).
  • Hot paths instrumented with P95 targets; background warmers for critical tenants.

🔐 Security Architecture

  • AuthN: OIDC (OpenIddict/AAD).
  • AuthZ: Tenant‑scoped RBAC (Admin/Operator/Reader/Auditor). Fine‑grained by key prefix and scope.
  • Data: TLS 1.3, encryption at rest, allow‑list of output content types (JSON/YAML/INI).
  • No secret persistence: store only references; resolve via Key Vault with managed identity.
  • Signed CloudEvents to prevent tampering; event payload links to audit record.

📡 Observability & Audit

  • OpenTelemetry traces on every request (traceId, tenantId, actor, scope, etag, result).
  • Metrics: read latency, cache hit ratio, per‑tenant QPS, write throughput, policy violations, stream fan‑out lag.
  • Structured logs with redaction and PII guards.
  • Audit Projector provides timeline views: who changed what, when, why (reason/issue link).

🧩 Persistence Model

  • Relational store (Azure SQL/PostgreSQL) for aggregates, versions, policies, audit projections.
  • Blob for snapshots/archives and large documents.
  • Redis for hot key‑spaces (read path).
  • Event Bus for reactive consumers (cache invalidation, stream fan‑out, CI hooks).

🛠 Technology Choices (concrete)

  • .NET 8 / ASP.NET Core
  • NHibernate (relational persistence), FluentValidation for DTOs
  • MassTransit + Azure Service Bus (events)
  • Redis (StackExchange.Redis), Azure SQL/PG (primary store), Azure Blob, Key Vault
  • gRPC for SDK channel; REST for admin and operational APIs
  • OpenTelemetry + Serilog; Grafana/Prometheus/App Insights

🧯 Reliability Patterns

  • Idempotent writes (semanticHash + traceId)
  • Retry with jitter for transient DB/Bus ops; circuit breakers around KV/DB
  • Backpressure on stream fan‑out; per‑tenant rate limits
  • Dead‑letter topics for failed events; replay from snapshots

🧪 Failure Scenarios & Mitigations (examples)

Scenario Mitigation
Redis outage Fallback to DB read (degraded latency), disable SWR, increase result TTL on recovery
Key Vault throttling Cache positive resolutions with short TTL; exponential backoff; circuit to “secret‑unavailable” marker
Event Bus lag Streamer polls on pull‑mode fallback; admin banner indicates lagged propagation
Hot‑spot tenant Per‑tenant cache partitioning; prefix sharding; SDK exponential backoff

☸️ Deployment Topology (AKS)

  • Stateless API pods behind internal LB; HPA based on RPS/latency
  • Streamer replicas scaled by subscription count
  • Workers (Snapshotter/Projector) with KEDA triggers (queue depth/CRON)
  • Zonal redundancy; rolling upgrades; PodDisruptionBudgets
  • Helm charts with environment overlays; GitOps optional

📑 ADR Backlog (to be authored)

  1. ADR‑001: Hierarchical Resolution Strategy (server‑side overlay, determinism)
  2. ADR‑002: Secrets by Reference vs inline storage (KV integration)
  3. ADR‑003: Event Transport selection (ASB vs Kafka) and CloudEvents schema
  4. ADR‑004: Cache Topology (prefix caches, invalidation, SWR)
  5. ADR‑005: Streaming Protocol (gRPC vs SSE/WebSockets) and backpressure
  6. ADR‑006: Persistence (Azure SQL vs PostgreSQL) and NHibernate mappings
  7. ADR‑007: RBAC Model (tenant/environment/key‑prefix scopes)
  8. ADR‑008: Snapshot & Rollback semantics (append‑only versioning)
  9. ADR‑009: Observability Defaults (required spans, logs, metrics)

We defined the ECS high‑level architecture: clean layering, core components, execution flows, tenancy/security enforcement, observability, and cloud‑native deployment. This blueprint is ready for detailed APIs, schemas, and SDK contracts.


✅ ECS PRD: Functional & Non‑Functional Requirements (Ready for ADO Epics)**

Scope: External Configuration Server (ECS) for multi‑tenant, edition‑aware, policy‑driven runtime configuration across ConnectSoft microservices (.NET, Clean Architecture, DDD, EDA). Outcome: A complete PRD slice that we can decompose into Azure DevOps Epics/Features/Stories in the next cycle.


🧭 Product Scope & Boundaries

In‑scope

  • Centralized configuration resolution and delivery for services (REST/gRPC/SDK).
  • Multi‑tenant + multi‑edition overrides with policy enforcement (RBAC, ABAC).
  • Environment layering (global → environment → edition → tenant → service → instance).
  • Versioning, rollout, preview, audit, and rollback.
  • Observability (metrics, logs, traces, audit trail) and event notifications.
  • Integrations: Azure Service Bus (events), Azure Key Vault (secrets reference), Redis (edge cache).

Out‑of‑scope (v1)

  • Secrets storage (managed by Key Vault; ECS stores references).
  • Feature experimentation framework (flags supported; experiments later).
  • UI Studio admin (initially minimal CRUD UI; advanced UX later).

🧩 Domain Model (DDD)

Aggregates

  • Tenant (TenantId, Name, Status, Editions[])
  • Edition (EditionId, Name, PolicySet)
  • ConfigBundle (BundleId, Scope, Keys[], Version, CreatedBy, CreatedAt)
  • ConfigItem (Key, Value, Type, SchemaRef, Metadata)
  • Policy (RBAC roles, ABAC conditions, constraints e.g., max TTL)
  • ResolutionRequest (ServiceId, TenantId, EditionId, Environment, InstanceTags[])
  • ResolutionResult (ResolvedMap, VersionGraph, SourceTrace[], ETag)

Scopes & Precedence (highest wins)

  1. Instance (Pod/Slot)
  2. Service (Microservice)
  3. Tenant
  4. Edition
  5. Environment (dev/stage/prod)
  6. Global

Conflict resolution = closest scope wins, then latest effective version, then policy constraint.


🧠 Core Functional Requirements

1) Read/Resolve

  • FR‑R1: Resolve configuration at runtime: Resolve(serviceId, tenantId, editionId, env, tags[]) → map
  • FR‑R2: Conditional resolution with preview mode (no write/audit side‑effects).
  • FR‑R3: Strong caching: ETag support; 304 semantics; per-scope TTL.
  • FR‑R4: Watch/Subscribe for change events via gRPC streaming or SSE.
  • FR‑R5: ResolveTrace: return provenance (which scopes/versions produced each key).

2) Write/Manage

  • FR‑W1: CRUD for Bundles & Items per scope; batch upserts with schema validation.
  • FR‑W2: Versioning on each change; FR‑W3: Rollback to prior version.
  • FR‑W4: Policy aware writes (RBAC roles + ABAC e.g., “only Ops may change prod”).
  • FR‑W5: Dry‑run validation (schema + policy + conflict preview).

3) Policy & Security

  • FR‑P1: RBAC roles: ECS.Admin, ECS.Editor, ECS.Auditor, ServiceAccount.
  • FR‑P2: ABAC conditions: environment, tenant, edition, label selectors.
  • FR‑P3: Audit every read/write (who/what/when/from where, masked values).
  • FR‑P4: Secrets as references: kvref://{vault}/{secretName}@{version}

4) Events & Integration

  • FR‑E1: Emit CloudEvents: ConfigChanged, PolicyChanged, BundlePublished, RollbackPerformed.
  • FR‑E2: Outbox pattern to Azure Service Bus; retries + idempotency keys.
  • FR‑E3: SDK callback hooks for hot‑reload in services.

5) Observability

  • FR‑O1: OTEL spans for resolve & write; FR‑O2: metrics (p95 resolve latency, cache hit ratio, 4xx/5xx).
  • FR‑O3: Structured logs with traceId, tenantId, serviceId, scope, keyCount.

🔒 Security & Compliance Requirements (NFR‑Security)

  • NFR‑S1: OIDC (OpenIddict/Azure AD). Service‑to‑service via client credentials; scopes per operation.
  • NFR‑S2: Data encryption at rest (DB) + TLS in transit.
  • NFR‑S3: PII/secret redaction in logs & audit streams; schema flags: sensitivity: pii|secret.
  • NFR‑S4: Multi‑tenant isolation in data partition & query filters; per‑tenant keys/ETags.

⚙️ Performance & Reliability (NFR‑Perf/Rel)

  • NFR‑P1: p95 resolve ≤ 30 ms (hot path with Redis); cold ≤ 150 ms.
  • NFR‑P2: Peak read QPS ≥ 20k (horizontally scalable API replicas).
  • NFR‑R1: Availability SLO 99.95% for read APIs.
  • NFR‑R2: Zero‑downtime publish; rolling upgrades; blue/green for storage migrations.
  • NFR‑C1: Config size per resolution ≤ 512 KB (soft limit), item size ≤ 16 KB (hard).

🧪 Quality & Validation (NFR‑QA)

  • NFR‑Q1: Contract tests for REST/gRPC (& SDK) with golden fixtures.
  • NFR‑Q2: Chaos tests: cache node loss, bus outage, DB failover; system remains read‑available (stale‑ok).
  • NFR‑Q3: Security tests for RBAC/ABAC bypass attempts; negative path coverage ≥ 95% rules.

🌉 External Interfaces

REST (subset)

  • GET /v1/resolve?serviceId&tenantId&editionId&env&tags=...200 {data, eTag, trace}
  • GET /v1/watch?serviceId... (SSE) → event: ConfigChanged
  • POST /v1/bundles/{scope} (upsert; JSON Schema validation)
  • POST /v1/preview/resolve (dry‑run)
  • POST /v1/rollback (bundleId, targetVersion)

gRPC

  • Resolve() unary; Watch() server stream with backoff hints.

Events (Azure Service Bus topics)

  • ecs.config.changed
  • ecs.bundle.published
  • ecs.policy.changed
  • ecs.bundle.rolledback

🧰 Client SDK (dotnet)

Package: ConnectSoft.Ecs.Client Features:

  • Typed options binding: services.AddEcsConfig<TOptions>("namespace:keyPrefix")
  • Hot reload via IOptionsMonitor and gRPC watch
  • Circuit‑breaker + per‑key caching with ETag
  • Secrets resolver (kvref://) with Key Vault client
builder.Services.AddEcsClient(o =>
{
  o.ServiceId = "billing-service";
  o.Environment = "prod";
  o.Edition = "enterprise";
  o.TenantIdProvider = () => TenantContext.Current?.TenantId;
});

🏗️ High‑Level Architecture

flowchart LR
  subgraph Clients
    SvcA[Service A]:::svc -->|Resolve/Watch| API
    SvcB[Service B]:::svc -->|Resolve/Watch| API
  end

  subgraph ECS
    API[REST/gRPC API]:::core
    Resolver[Resolver Engine]:::core
    Policy[Policy Engine RBAC/ABAC]:::core
    Cache[(Redis Edge Cache)]:::infra
    Store[(Config DB)]:::infra
    Outbox[Outbox]:::core
    Bus[[Azure Service Bus]]:::infra
    Audit[(Audit Store)]:::infra
    KV[[Azure Key Vault]]:::infra
  end

  API --> Resolver --> Cache
  Resolver <--> Store
  Resolver --> Policy
  Resolver --> KV
  API --> Outbox --> Bus
  API --> Audit

  classDef core fill:#E0F2FE,stroke:#38BDF8,stroke-width:1.2px;
  classDef infra fill:#F5F3FF,stroke:#8B5CF6,stroke-width:1.2px;
  classDef svc fill:#ECFCCB,stroke:#84CC16,stroke-width:1.2px;
Hold "Alt" / "Option" to enable pan & zoom

Storage options (pluggable via Clean Architecture):

  • Primary: SQL (PostgreSQL/SQL Server) via NHibernate.
  • Optional: Cosmos DB provider.
  • Cache: Redis (clustered).

🔁 Resolution Algorithm (Deterministic)

  1. Identify scope chain from request metadata.
  2. Load latest effective versions for each scope (bundle snapshots).
  3. Merge maps top‑down (global→env→edition→tenant→service→instance).
  4. Apply policy constraints (deny/override/mask).
  5. Expand secret references (Key Vault) if caller has scope (allowInlineSecrets=false by default).
  6. Compute ETag (stable hash); return map + SourceTrace.

Edge cases:

  • Missing keys → default value policy (deny/allow default).
  • Conflicts → highest precedence wins; policy can block high precedence if violating constraints.

📊 Observability Contract

Metrics

  • ecs_resolve_latency_ms (p50/p95/p99)
  • ecs_cache_hit_ratio
  • ecs_resolve_qps
  • ecs_write_errors_total
  • ecs_stream_clients_gauge

Logs & Traces

  • Enriched with traceId, tenantId, serviceId, scope, bundleId, version, redaction flags.

Audit

  • who, when, what, where (ip/user-agent), before/after diff (masked).

🚀 Delivery & Rollout

  • Blue/Green API pods, Redis with keyspace notifications off (we push invalidate signals via Bus).
  • Write path: transactional write → outbox → publish → cache invalidate (fan‑out key patterns).
  • Hot Reload: streaming watchers receive ConfigChanged with ETag → services re‑resolve.

🧯 Failure Modes & Recovery

  • Cache outage → fallback to DB (latency up; circuit policy).
  • Bus outage → outbox retry; watchers retry with jitter.
  • DB outage → serve last-known ETag from cache for N minutes (stale‑ok policy); emits degraded events.

📐 Acceptance Criteria (Representative)

  • AC‑R‑01: Given a tenant + service + edition chain, when a config key exists in multiple scopes, then the service‑scoped value is returned and SourceTrace lists all overridden values.
  • AC‑W‑02: A write with invalid schema returns 422 and does not create a new version; audit shows failure reason.
  • AC‑P‑03: A user with ECS.Editor cannot write to env=prod without abac: env=prod condition → 403.
  • AC‑E‑04: On successful publish, a ConfigChanged event is emitted and ≥95% subscribed services observe change within 3s (under nominal load).
  • AC‑O‑05: p95 resolve latency ≤ 30 ms with warm cache under 5k RPS.

📦 Data Model (Simplified)

ConfigBundle:
  bundleId: guid
  scope:
    level: Global|Environment|Edition|Tenant|Service|Instance
    identifiers: { environment?, editionId?, tenantId?, serviceId?, instanceId? }
  version: int
  items: [ConfigItem]
  tags: [string]
  checksum: string
  createdBy: userId
  createdAt: datetime
  status: Draft|Published|Deprecated

ConfigItem:
  key: string
  type: string # string|int|bool|json|uri|kvref
  value: any
  schemaRef: uri?
  metadata: { sensitivity?: pii|secret, ttl?: int, description?: string }

🗺️ Azure DevOps Backlog Shape (Preview — detailed breakdown next cycle)

Epic A — Resolution & Delivery

  • Feature A1: Resolve API + gRPC + ETag
  • Feature A2: Watch/Subscribe streaming
  • Feature A3: Redis caching + invalidation

Epic B — Authoring & Versioning

  • Feature B1: Bundle/Item CRUD + batch upsert
  • Feature B2: Versioning & rollback
  • Feature B3: Preview/dry‑run + diff

Epic C — Policy & Security

  • Feature C1: RBAC/ABAC engine
  • Feature C2: Audit & redaction
  • Feature C3: OIDC integration & scopes

Epic D — Events & SDK

  • Feature D1: CloudEvents + Outbox to ASB
  • Feature D2: .NET SDK (OptionsMonitor, KeyVault resolver)
  • Feature D3: Sample integration in 2 reference services

Epic E — Observability & SLOs

  • Feature E1: OTEL spans, metrics, logs
  • Feature E2: Dashboards & SLO checks
  • Feature E3: Chaos/resiliency tests

📌 Definitions of Done (per Feature)

  • ✅ API/gRPC contract approved (OpenAPI/proto + breaking change check).
  • ✅ Unit/integration/contract tests ≥ 85% line/branch in feature scope.
  • ✅ OTEL + logs + metrics + audit events present & validated.
  • ✅ Security checks (RBAC/ABAC/PII redaction) pass.
  • ✅ Load tests meet SLOs (p95, error rate, cache hit).
  • ✅ Docs: README, runbook, examples, SDK snippet.

🔗 Integration Points & APIs

🎯 Purpose

The External Configuration Server (ECS) must expose consistent, secure, and flexible APIs to integrate with all consumers in the ConnectSoft AI Factory ecosystem. These interfaces enable real-time retrieval, updates, and subscriptions to configuration values, ensuring both human and machine actors interact seamlessly.


📡 Integration Interfaces

Interface Type Purpose Consumers
REST API CRUD operations on tenant/edition configs, metadata management Studio UI, DevOps tools, external systems
gRPC High-performance, strongly typed API for runtime config resolution Microservices, Agents (low-latency config fetch)
Event Bus Publish/subscribe to configuration changes as CloudEvents v1.0 Domain services, Architect Agents, QA Agents
SDKs/Libraries Thin client libraries (C#, TypeScript, Python) for easy config injection Factory microservices, test harnesses

📘 API Contracts

REST

  • GET /config/{tenant}/{service} → Retrieve active config
  • POST /config/{tenant}/{service} → Create/update config
  • GET /history/{tenant}/{service} → Fetch config audit trail
  • POST /rollback/{tenant}/{service} → Rollback to prior version

gRPC

  • ResolveConfig(ResolveRequest) → ResolveResponse
  • StreamConfigUpdates(SubscriptionRequest) → stream ConfigEvent

Events (CloudEvents)

  • Type: factory.ecs.v1.config.updated
  • Attributes: tenantId, editionId, serviceName, version, traceId, semanticHash
  • DataRef: points to signed configuration snapshot in blob store

🧩 SDK & Client Libraries

  • C# ECS.Client

  • ConfigProvider.Resolve(service, tenant)

  • ConfigProvider.Subscribe(service, tenant, callback)
  • TypeScript SDK (for UI & frontend agents)

  • Hooks for live subscription updates

  • Python SDK (for AI agents & ML services)

  • Easy access to tenant configs in training/inference pipelines


🏗 Diagram – ECS Integration Surfaces

flowchart LR
    StudioUI -->|REST| ECS
    DevOpsAgent -->|REST| ECS
    MicroserviceA -->|gRPC| ECS
    MicroserviceB -->|gRPC| ECS
    ArchitectAgent -->|Events| ECS
    QAAgent -->|Events| ECS
    ECS -->|SDKs| ClientLibs[(C#/TS/Python SDKs)]
Hold "Alt" / "Option" to enable pan & zoom

📘 Principles for Integration

  1. Idempotency: All API updates are idempotent (semanticHash-based).
  2. Traceability: Each API call returns a traceId for correlation.
  3. Security-First: All endpoints secured with OAuth2 scopes (e.g., ecs.read, ecs.write).
  4. Versioning: APIs versioned (/v1/, vNext) with strong backward compatibility.
  5. Polyglot Support: SDKs generated using OpenAPI/gRPC tooling for multiple languages.

➡️ With these integration points, ECS becomes an accessible backbone service, allowing any microservice, agent, or human operator to retrieve and react to configuration in real-time.


🏢 Multi-Tenancy & Edition Management

🎯 Purpose

The External Configuration Server (ECS) must support multi-tenancy and edition-aware configuration management as a first-class concern. This ensures that different customers (tenants) and their specific editions (pricing or functional tiers) can receive isolated, policy-driven configurations without conflict.


🌍 Multi-Tenancy Principles

  • Isolation by Tenant → Configurations are stored, resolved, and audited per tenant boundary.
  • Shared Infrastructure, Segregated Data → ECS runs as a shared service, but configuration data is tenant-scoped.
  • RBAC-Scoped Access → Only authorized roles (e.g., Tenant Admin, Factory Operator) can mutate tenant-specific configs.
  • Noisy Neighbor Protection → Rate limiting and quotas prevent one tenant from overloading ECS capacity.

🧩 Edition Awareness

  • Base Business Model → Provides default configuration values (global, edition-agnostic).
  • Edition Overrides → Editions (e.g., Free, Standard, Enterprise) override global values for functionality toggles, limits, or integrations.
  • Tenant Overrides → Individual tenants may further override edition defaults, within defined policies.
  • Policy Guardrails → Edition-level limits (e.g., “Enterprise supports unlimited API calls, Free has 10k/month”) are enforced automatically.

📘 Hierarchical Resolution Model

Config resolution follows a hierarchical override chain:

Global Default → Edition Config → Tenant Config → Runtime Override

Example:

  • Default: MaxUsers = 100
  • Edition: Enterprise → MaxUsers = 1000
  • Tenant ABC (Enterprise) → MaxUsers = 1500 (if allowed by policy)

🔐 Security & Governance

  • Per-Tenant Encryption → Config values encrypted with tenant-specific keys in Azure Key Vault.
  • Audit Scope → Every change linked to tenant, edition, actor, and traceId.
  • Cross-Tenant Safety → ECS enforces absolute separation—no tenant can read another’s configuration.

🏗 Diagram – Multi-Tenant & Edition Config Flow

flowchart TD
    Global[🌐 Global Config] --> Edition[📦 Edition Config]
    Edition --> Tenant[🏢 Tenant Config]
    Tenant --> Runtime[⚡ Runtime Override]

    Runtime --> ECS[(External Config Server)]
    ECS --> ServiceA[Service A]
    ECS --> ServiceB[Service B]
Hold "Alt" / "Option" to enable pan & zoom

📊 Example Use Case

Tenant Edition Config Key Value
Global Default FeatureX OFF
All Enterprise FeatureX ON
ABC Enterprise FeatureX ON, custom rate 500/s
XYZ Free FeatureX OFF

📘 Principles for Multi-Tenant Management

  1. Edition-Aware Inheritance: Every tenant config inherits from its edition baseline.
  2. Policy-Driven Overrides: Tenants can override edition configs only within guardrails.
  3. Strict Isolation: No cross-tenant data visibility or leakage.
  4. Auditability: All tenant/edition config changes fully traceable.
  5. Dynamic Runtime Resolution: Overrides are resolved on every config fetch (no redeploys).

➡️ With this, ECS becomes SaaS-ready: one service, securely serving thousands of tenants with multiple editions, while preserving isolation, scalability, and flexibility.


💾 Persistence & Storage Design

🎯 Purpose

The External Configuration Server (ECS) requires a robust persistence layer to store, version, and retrieve configuration data across tenants, editions, and environments. This cycle defines how ECS persists configuration artifacts, ensures durability, and enables auditability at scale.


🏛 Core Principles

  1. Multi-Model Storage → Structured metadata in SQL; flexible config documents in NoSQL.
  2. Immutable Versioning → Every change creates a new version; rollback is always possible.
  3. Separation of Concerns → Config metadata (who/when/where) stored apart from config payloads (what).
  4. Tenant & Edition Isolation → Partitioning strategies prevent cross-tenant conflicts.
  5. Cloud-Native Durability → Built on Azure SQL + Cosmos DB + Key Vault with geo-redundancy.

🗂 Data Model

Entities

  • ConfigItem

  • ConfigId (GUID)

  • TenantId
  • EditionId
  • Environment (Dev, QA, Prod)
  • PayloadRef (pointer to storage blob/NoSQL doc)

  • ConfigVersion

  • VersionId

  • ConfigId (FK)
  • Hash (SHA-256 for integrity)
  • CreatedAt
  • CreatedBy (User/Service ID)
  • ChangeReason

  • AuditLog

  • EntryId

  • ConfigId
  • VersionId
  • Actor
  • Action (Create, Update, Rollback)
  • TraceId

🗄 Storage Technologies

Layer Technology Purpose
Metadata Azure SQL Strong consistency for IDs, references, indexes
Payload Storage Azure Cosmos DB Flexible JSON docs for config payloads
Secrets Azure Key Vault Encrypt sensitive values per-tenant
Backups/Archives Azure Blob Cold storage of snapshots & exports

🔄 Versioning & History

  • Every config update produces a new immutable version.
  • Versions can be queried:

  • Latest only (default)

  • Specific version (for rollback/testing)
  • Range queries (audit & troubleshooting)
  • ECS guarantees event-sourced persistence → “What changed, when, and by whom” is always reconstructable.

🗃 Partitioning & Indexing

  • TenantId + EditionId = Primary partition key in Cosmos DB.
  • Environment indexed for fast lookups.
  • Hot configs cached in Redis for ultra-low-latency retrieval.

🔐 Security

  • All payloads encrypted at rest (AES-256).
  • Per-tenant encryption keys in Key Vault.
  • Fine-grained RBAC on read/write at DB layer.
  • Write operations only through ECS API (no direct DB access).

🏗 Diagram – Persistence Flow

flowchart TD
    App[Client Service] --> ECS[(External Config Server)]
    ECS --> SQL[(Azure SQL Metadata)]
    ECS --> COSMOS[(Cosmos DB Payloads)]
    ECS --> KeyVault[(Azure Key Vault)]
    ECS --> Blob[(Azure Blob Snapshots)]
Hold "Alt" / "Option" to enable pan & zoom

📊 Example

  • Tenant: VetClinicCo
  • Edition: Enterprise
  • Env: Production
  • Config: MaxConcurrentJobs = 500

Version history:

  • v1 → 100 (default)
  • v2 → 200 (edition override)
  • v3 → 500 (tenant override, policy-approved)

All versions persist in Cosmos DB with metadata in SQL and secured values in Key Vault.


✅ Principles Recap

  1. Immutable Versioning – No destructive updates.
  2. Multi-Model Storage – SQL + NoSQL hybrid design.
  3. Per-Tenant Isolation – Partition + encryption per tenant.
  4. Audit-First – Every change logged with traceability.
  5. Cloud-Native Resilience – High availability + geo-redundant backup.

🔐 Security & Access Control

🎯 Goals

The External Configuration Server (ECS) must be secure by design, ensuring that only authenticated and authorized actors can access or modify configuration data. Security must address multi-tenancy, API surface hardening, and end-to-end trust across environments.


🧩 Security Requirements

Requirement Implementation Strategy
Authentication OpenIddict / Azure AD integration for OAuth2 & OIDC.
Authorization (RBAC/ABAC) Fine-grained role-based and attribute-based access per tenant, edition, and environment.
Tenant Isolation Scoped tokens that prevent cross-tenant data access. Tenant ID enforced in every query.
Secure Communication gRPC over mTLS + REST endpoints with TLS 1.3.
Secrets Handling Integrate with Key Vault (Azure Key Vault or HashiCorp Vault) for secret material.
Policy-Driven Overrides Apply policies that enforce configuration rules at runtime (e.g., edition cannot override tenant global).
Audit & Traceability Every read/write secured with claims and logged with trace context.

🔑 Authentication & Identity Flow

sequenceDiagram
    participant Client as Microservice/Agent
    participant ECS as External Config Server
    participant IDP as Identity Provider (OpenIddict/Azure AD)

    Client->>IDP: Request Token (Client Credentials / On-Behalf-Of)
    IDP-->>Client: Access Token (JWT w/ claims)
    Client->>ECS: Call API w/ Bearer Token
    ECS->>IDP: Validate Token (signature, expiry, audience, tenant claims)
    ECS-->>Client: Authorized Response (config values)
Hold "Alt" / "Option" to enable pan & zoom
  • Supported Grants: Client Credentials (service-to-service), Authorization Code (Studio UI), On-Behalf-Of (agent delegation).
  • JWT Validation: Issuer, audience, expiry, tenant ID, roles.

🛡️ Authorization Model

Roles

  • ConfigAdmin → Full control over tenant configs.
  • ConfigEditor → Create/update configs within tenant scope.
  • ConfigReader → Read-only access.
  • SystemObserver → Access to audit/logs for compliance.

Enforcement

  • RBAC enforced at API layer (per endpoint).
  • ABAC used for policy enforcement (e.g., “Edition config overrides allowed only if flag enabled”).

🏢 Tenant Isolation

  • Each request must carry a TenantID claim (in token).
  • ECS enforces row-level isolation at persistence level (per-tenant schemas or partition keys).
  • No cross-tenant visibility in APIs or events.

🔐 Secrets Handling

  • ECS does not store secrets directly; instead, it stores references (URIs, Key Vault keys).
  • Integration with:

  • Azure Key Vault for cloud deployments.

  • HashiCorp Vault for hybrid setups.
  • ECS fetches secrets at runtime with least-privilege identity (Managed Identity in Azure).

🔏 Secure API Surface

  • REST → secured via TLS 1.3 + OAuth2 Bearer tokens.
  • gRPC → secured via mTLS (mutual certificate-based auth) + OAuth2.
  • Rate limiting & throttling to prevent abuse.
  • CORS restrictions for Studio UI.

⚖️ Policy-Driven Overrides

  • Global → Tenant → Edition → Environment → Service hierarchy.
  • Policies enforce:

  • No tenant override of system-critical keys.

  • Edition policies can restrict tenant-level overrides.
  • Environments can enforce stricter security keys (prod > dev).

📊 Audit & Compliance

  • Every config read/write logged with:

  • Actor identity (user/service).

  • Tenant & edition.
  • Trace context (OpenTelemetry).
  • Immutable audit logs → stored in append-only storage (e.g., Azure Blob immutability policies).
  • Compliance: SOC2, GDPR (data access restrictions), ISO 27001 alignment.

✅ ECS gains end-to-end trust, fine-grained authorization, and policy-driven safeguards—making it resilient against misuse while enabling flexible tenant/edition operations.


📈 Observability & Monitoring

🎯 Goals

Give the External Configuration Server (ECS) complete, actionable visibility over read/write paths and propagation so teams can detect, debug, and prevent issues fast.

  • Metrics: SLOs/SLA tracking for resolve latency, availability, cache hit ratio, event lag.
  • Tracing: End-to-end spans across client → ECS → cache/DB → Key Vault → bus.
  • Logging: Structured, privacy-safe logs with correlation.
  • Audit Trails: Immutable evidence of who changed what, when, where, and why.
  • Dashboards & Alerts: Grafana/App Insights views and guardrail alerts wired to on-call.

🔭 Signal Model (four pillars)

Pillar Purpose Where
Metrics SLOs, trends, capacity Prometheus/App Insights Metrics
Traces Causality & latency breakdown OpenTelemetry → OTLP exporter
Logs Forensic details & exceptions Serilog → ELK/App Insights
Audit Compliance-grade change evidence Append-only store + query view

🧩 Telemetry Architecture

flowchart LR
  Client[(Service/SDK)] --OTEL ctx--> API[Config API]
  API --> Resolver
  API --> AdminAPI
  Resolver --> Redis[(Redis)]
  Resolver --> SQL[(SQL)]
  Resolver --> KV[(Key Vault)]
  AdminAPI --> Bus[[Event Bus]]
  API & AdminAPI --> OTEL[OpenTelemetry Exporter]
  OTEL --> Prom[Prometheus]
  OTEL --> AI[App Insights]
  Logs[Serilog/JSON] --> ELK[(ELK/App Insights Logs)]
  AdminAPI --> Audit[(Immutable Audit Store + Projections)]
Hold "Alt" / "Option" to enable pan & zoom

📊 Metrics (names, labels, targets)

Service-level

  • ecs_resolve_requests_total{tenant,env,service}
  • ecs_resolve_latency_ms{quantile=50|95|99, tenant, env}p95 ≤ 30ms (hot), ≤150ms (cold)
  • ecs_resolve_errors_total{code}; rate(…[5m]) < 0.5%
  • ecs_cache_hit_ratio{tenant}≥ 0.85
  • ecs_event_fanout_lag_ms{topic}p95 < 2000ms
  • ecs_stream_clients_gauge{node} (active watchers)
  • ecs_write_requests_total{op=create|update|rollback}
  • ecs_write_policy_violations_total{policyId, severity}

Infra

  • redis_ops_total{cmd}, redis_ttl_seconds_bucket
  • sql_query_latency_ms{query}
  • kv_request_latency_ms{op=getSecret} (sampled)
  • bus_publish_latency_ms{topic}

SLO rollups

  • slo_availability_ratio (read API) → ≥ 99.95%
  • slo_freshness_ratio (watchers receive change < 3s) → ≥ 95%

Store as Prometheus counters/histograms; mirror key KPIs into App Insights for product owners.


🧵 Tracing (OpenTelemetry)

  • Trace root: ResolveConfig (read) / WriteConfig (write)
  • Spans:

  • cache.get(prefix) / cache.set(prefix)

  • sql.query(hierarchy) / sql.upsert(bundle)
  • kv.resolve(secretRef) (redacted attributes)
  • bus.publish(ConfigChanged)
  • Trace context: trace_id, span_id, tenantId, env, serviceId, editionId, etag, version
  • Baggage (lightweight): request_class=hot|cold, cache_hit=true|false

Sampling

  • Default 10% for hot read path; 100% for errors/slow spans (p95+), 100% for writes.

🧾 Logging (structured & safe)

  • Format: JSON, Serilog.
  • Required fields: ts, level, message, traceId, spanId, tenantId, env, serviceId, actor, operation, status, durationMs
  • Redaction: never log secret values; mask values with "<redacted>"; log key paths and hashes only.
  • Error taxonomy:

  • ValidationError (422), PolicyViolation (403), ConcurrencyConflict (409), UpstreamTimeout (502), InternalFailure (500).

Example (write path)

{
  "ts":"2025-08-21T18:25:43Z",
  "level":"Information",
  "message":"Config bundle published",
  "traceId":"3f1f…",
  "tenantId":"t-abc",
  "env":"prod",
  "operation":"PublishConfig",
  "bundleId":"b-77b2",
  "version":42,
  "keysTouched":128,
  "policyViolations":0,
  "etag":"W/\"d41d8c\""
}

🧑‍⚖️ Audit Trails (immutable & queryable)

  • What is audited: All writes (create/update/delete/rollback), policy decisions, and admin reads of sensitive scopes.
  • Record: auditId, actor, actorType(user|service), action, scope(global|edition|tenant|service|instance), beforeHash, afterHash, reason, ip, userAgent, traceId, ts.
  • Storage:

  • Append-only (e.g., Blob with immutability policy / WORM).

  • Projection into SQL table(s) for timelines and Studio queries.
  • Retention: 24 months (configurable by tenant plan).
  • Export: Signed CSV/JSON with evidence chain (hash of file published to audit ledger topic).

📺 Dashboards (Grafana/App Insights)

Operations (NOC)

  • Read latency (p50/p95/p99) by env & tenant
  • Error rate by route
  • Cache hit ratio
  • Event fan-out lag
  • Stream clients trend

SRE

  • DB query time heatmap; Redis saturation; Key Vault throttling; Bus publish latency
  • Top tenants by QPS; hotspot keys/prefixes
  • SLA/SLO burn-down (error budget)

Product/Compliance

  • Changes by tenant/edition over time
  • Policy violations & severities
  • Audit export status

🚨 Alerts (examples)

Condition Expression (PromQL-ish) Action
p95 resolve latency > 60ms for 10m hist_quantile(0.95, rate(ecs_resolve_latency_ms_bucket[10m])) > 60 Page SRE (P2)
Error rate > 1% for 5m rate(ecs_resolve_errors_total[5m]) / rate(ecs_resolve_requests_total[5m]) > 0.01 Page SRE (P2)
Cache hit < 70% for 15m avg_over_time(ecs_cache_hit_ratio[15m]) < 0.7 Ticket + Slack (P3)
Event lag p95 > 5s for 5m hist_quantile(0.95, rate(ecs_event_fanout_lag_ms_bucket[5m])) > 5000 Page on-call (P2)
Audit write failures > 0 increase(ecs_audit_write_errors_total[5m]) > 0 Page SRE (P1)

🧪 Observability DoD (per feature)

  • OTEL spans present with correct attributes and parentage.
  • Metrics emitted with required labels (tenantId, env, serviceId).
  • Logs redact PII/secret content; include correlation IDs.
  • Audit entries created for all write operations with before/after hashes.
  • Dashboards panels updated; alert rules validated in staging chaos runs.

🛠 Runbook Snippets

Trace a slow read

  1. Locate alert → pick traceId.
  2. In traces, check spans: cache.get (miss?) → sql.query (slow?) → kv.resolve (throttled?).
  3. If cache misses spike, check invalidation storm from recent publish.
  4. Mitigate: raise TTL temporarily, throttle publishers, enable SWR.

Missing change in consumers

  1. Check ecs_event_fanout_lag_ms & outbox backlog.
  2. Verify subscriber count; look for DLQ on the topic.
  3. If lag high: scale Notifier/Streamer; apply backpressure config.

🔒 Privacy & Compliance

  • PII/Secrets: never stored in metrics/traces/logs; only hashes and key paths permitted.
  • Data residency: metrics/logs stored regionally per tenant policy where required.
  • Access: dashboards segmented by role; audit exports require elevated scope and MFA.

✅ Outcome

ECS now has operational telemetry, forensic visibility, and compliance-grade auditability. With dashboards and alerts, issues are caught early and traced precisely—supporting our SLOs and tenant commitments.


⚡ Eventing & Change Propagation

How ECS publishes configuration changes and guarantees safe, observable fan‑out across the factory.


🎯 Goals

  • Make every configuration change a first‑class event (CloudEvents) with tenant/edition scope.
  • Provide low‑latency fan‑out to thousands of services via Azure Service Bus (default) and Kafka (optional).
  • Guarantee at‑least‑once delivery with consumer idempotency, retries, DLQ, and replay.
  • Support versioned configurations (optimistic concurrency, ETags, semantic versions, snapshots, and deltas).
  • Preserve traceability: traceId, changeSetId, configVersion on every hop via OpenTelemetry.

🧩 Event Model

Event Envelope (CloudEvents 1.0)

{
  "specversion": "1.0",
  "type": "com.connectsoft.ecs.ConfigurationChanged",
  "source": "/ecs/tenants/{tenantId}/namespaces/{ns}/keys/{configKey}",
  "id": "evt_{changeSetId}",
  "time": "2025-08-21T11:25:43.123Z",
  "datacontenttype": "application/json",
  "subject": "tenant:{tenantId}|edition:{edition}|env:{environment}",
  "traceparent": "00-{traceId}-{spanId}-01",
  "dataschema": "https://schemas.connectsoft.dev/ecs/v1/events/configuration-changed.json",
  "data": {
    "tenantId": "t-001",
    "edition": "enterprise",
    "environment": "prod",
    "namespace": "payments",
    "configKey": "payment.retryPolicy",
    "changeSetId": "cset_7f2a7c",
    "configVersion": "2025.08.21+00023",
    "operation": "Upsert",     // Upsert | Delete | Patch
    "payloadType": "Delta",    // Snapshot | Delta
    "etag": "W/\"b9-1e9b\"",
    "checksum": "sha256:3be2e...",
    "previousVersion": "2025.08.21+00022",
    "changedBy": "user:alice@connectsoft.ai",
    "reason": "Increase retries for PSP flakiness",
    "ttlSeconds": 600
  }
}

Why CloudEvents? Uniform schema for REST webhooks, Service Bus topics, Kafka topics, and gRPC streaming — minimizing adapters.


🛰️ Transports & Channels

Transport ECS Role Default Topology Notes
Azure Service Bus Primary Topic ecs.config.changed.v1 → Subscriptions per consumer group (service) Delivery & retry semantics built‑in; rules for tenant/edition filters
Kafka Optional Topic ecs.config.changed.v1 with tenant/edition partitions High‑throughput; consumer offset storage
gRPC Server Streaming Optional WatchConfig() per (tenantId, namespace, selector) For in‑cluster low‑latency watchers
Webhooks Optional Signed HTTP POST with CloudEvents envelope For third‑party SaaS integrations

Topic Partitioning

  • Service Bus: use subscriptions with correlation filters on tenantId, edition, namespace, configKey.
  • Kafka: partition by hash(tenantId|namespace) for locality; key = tenantId|namespace|configKey.

🔄 Publisher Flow (ECS)

sequenceDiagram
  participant Client as Studio/API Client
  participant ECS as ECS API
  participant OUTBOX as ECS Outbox
  participant BUS as Service Bus / Kafka
  participant OTel as OTEL Exporter

  Client->>ECS: PUT /configs/{tenant}/{ns}/{key} (If-Match: etag)
  ECS->>ECS: Validate, authorize, persist (version + etag)
  ECS->>OUTBOX: Append Outbox record (changeSetId, payload)
  ECS->>OTel: Span: ecs.config.write
  OUTBOX-->>BUS: Publish CloudEvent
  BUS-->>*Consumers: Deliver event (at-least-once)
Hold "Alt" / "Option" to enable pan & zoom
  • Outbox pattern ensures atomic persistence + event publish (transactional outbox + background dispatcher).
  • Idempotent publish via (changeSetId, etag) de‑duplication at dispatcher.

🧠 Consumer Contract (SDK)

All ECS client SDKs (.NET first) expose a uniform consumption model:

public interface IEcsChangeSubscriber
{
    Task OnConfigurationChangedAsync(ConfigurationChangeEvent evt, CancellationToken ct);
}

// Usage:
await ecsSubscriber.SubscribeAsync(new EcsSubscription
{
    TenantId = "t-001",
    Namespace = "payments",
    Selector = "payment.*",             // glob / regex
    StartingVersion = "2025.08.21+00020"
});

Consumer Responsibilities (enforced by SDK defaults):

  • Idempotency: persist lastAppliedVersion per (tenantId, namespace, configKey); skip if evt.configVersion <= lastApplied.
  • Transactional Apply: update local cache + notify live options atomically; on failure, retry with exponential backoff.
  • Rehydration: on startup, call GET /snapshot?...since=lastAppliedVersion, then process backlog from bus.
  • Backpressure: apply bounded queues; expose ecs_consumer_backlog_size.

🧷 Delivery Semantics & Reliability

At‑Least‑Once Delivery

  • Service Bus: max delivery attempts (e.g., 10), dead‑letter after exhaustion.
  • Kafka: enable.auto.commit=false, commit offsets after successful apply.

Idempotency Patterns

Layer Mechanism
Publisher Outbox de‑dup on (changeSetId, etag)
Transport Deterministic key for partitioning
Consumer lastAppliedVersion store + checksum verification

Retries

  • Publisher: dispatcher retries with jitter, circuit‑breakers to BUS.
  • Consumer: SDK retries transient errors; poison events → DLQ with dead‑letter reason (checksum_mismatch, authorization_revoked, schema_incompatible).

Replay

  • Replay API: GET /events/stream?tenantId=...&fromVersion=...&toVersion=... (signed, time‑boxed).
  • Use cases: Disaster recovery, blue/green cutover, warm caches.

🧱 Versioning Strategy

Concept Purpose Implementation
ETag / If‑Match Optimistic concurrency W/"b9-1e9b" round‑trips on write
Monotonic Version Ordering yyyy.MM.dd+build (e.g., 2025.08.21+00023)
Semantic Version (optional) Compatibility signals 1.4.0 attached as configSemVer
Snapshots Full state at point‑in‑time GET /snapshot?tenantId&namespace
Deltas Network efficiency Event payloadType: Delta with patch ops (RFC6902 JSON Patch)
Schema Evolution Safe change Avro/JSON Schema; dataschema URI versioned; compat checks at publish

Server Rules

  • Breaking change detected? Require force=true&ack=... header + elevated RBAC role + audit record.
  • Automatically emit compat warning events when consumers with old consumerSchemaVersion are detected (telemetry handshake).

🧳 Change Types

  • ConfigurationChanged (delta/snapshot)
  • ConfigurationDeleted
  • NamespacePolicyChanged (RBAC/edition rules)
  • SecretRotated (metadata only; values are out‑of‑band via Vault references)
  • BulkConfigurationChanged (batch changes carry list of keys and a single changeSetId)

🔐 Security & Integrity (Event Path)

  • mTLS for gRPC; AMQP over TLS for Service Bus; SASL/SSL for Kafka.
  • Claims in event metadata: sub, roles, policyHash.
  • Signature: optional JWS detached signature of data for tamper detection (consumers validate against ECS JWKS).
  • PII/Sensitive: events never include secret values — only Key Vault references & hashes.

🧠 In‑Process Hot Reload (Options Pattern)

  • ECS SDK plugs into .NET IOptionsMonitor<T>, mapping config keys to strongly‑typed options.
  • When an event arrives, SDK rebinds options and triggers OnChange callbacks with debounce to prevent thrashing.
services.Configure<RetryPolicyOptions>(builder => builder
    .BindFromEcs("payments", "payment.retryPolicy"));

📊 Observability for Eventing

  • Spans: ecs.publish, ecs.dispatch, ecs.consume, ecs.apply.
  • Metrics:

  • ecs_events_published_total, ecs_events_consumed_total

  • ecs_consumer_lag_seconds (per service/tenant)
  • ecs_replay_requests_total
  • ecs_dropped_events_total (with reason)
  • Logs: structured with changeSetId, configVersion, tenantId, namespace, configKey.

🗺️ Routing & Filtering

Subscription Filters (Service Bus):

  • SQL filters on user.properties.tenantId, edition, namespace, configKey.
  • Rule examples:

  • tenantId = 't-001' AND namespace = 'payments'

  • edition IN ('enterprise','pro') AND configKey LIKE 'feature.%'

Kafka Consumers:

  • Use predicate filter inside consumer callback (cheap check) before apply.

🧰 Failure Scenarios & Handling

Scenario Behavior
Consumer offline On startup → snapshot + replay from last offset/version
Schema mismatch Send to DLQ with schema_incompatible; emit CompatibilityAlert
Long‑running apply SDK offloads apply to worker; ack only after commit; backpressure increases
Tenant revocation ECS publishes AccessRevoked; SDK drops subscriptions & clears cache

🧪 Test Matrix (Factory‑grade)

  • Contract tests: CloudEvents schema validation; signature verification.
  • Resilience tests: injected duplicates/out‑of‑order delivery; ensured idempotent apply.
  • Load tests: 10k events/min, 3k consumers; measure consumer lag & mean apply latency.
  • Chaos: bus outages, partial partitions; verify snapshot + replay convergence.

🧱 Reference Diagrams

Fan‑Out Topology

flowchart LR
  ECS[ECS Outbox Dispatcher] -->|CloudEvents| ASB[(Service Bus Topic)]
  ECS -->|CloudEvents| KAFKA[(Kafka Topic)]
  ASB --> A1[Service A - payments]
  ASB --> A2[Service B - billing]
  KAFKA --> A3[Service C - gateway]
  A1 --> Cache1[(Local Config Cache)]
  A2 --> Cache2[(Local Config Cache)]
  A3 --> Cache3[(Local Config Cache)]
Hold "Alt" / "Option" to enable pan & zoom

Replay & Convergence

sequenceDiagram
  participant Svc as Service Consumer
  participant ECS as ECS API
  participant Bus as Bus Topic

  Svc->>ECS: GET /snapshot?since=2025.08.21+00020
  ECS-->>Svc: Snapshot (Version=...22)
  Svc->>Bus: Resume subscription (offset/version=...22)
  Bus-->>Svc: Events (...23, ...24)
  Svc->>Svc: Apply in order, idempotent
Hold "Alt" / "Option" to enable pan & zoom

📦 Deliverables

  • ECS Outbox Dispatcher (.NET Worker) with ASB/Kafka providers.
  • CloudEvents Contracts + JSON Schema v1 (configuration-changed.json etc.).
  • .NET SDK: SubscribeAsync, IOptionsMonitor binding, idempotent store, snapshot+replay.
  • gRPC Watch Service: WatchConfig(WatchRequest) -> stream ConfigurationChangeEvent.
  • Ops Dashboards: consumer lag, publish rates, DLQ drilldowns.
  • Conformance Tests: duplication, ordering, replay convergence.

🧱 Backlog → Azure DevOps (Epics/Features/Tasks)

Epic: ECS Eventing Backbone (ASB/Kafka)

  • Feature: Outbox storage + dispatcher

  • Task: Outbox table & transactional write (NHibernate)

  • Task: Dispatcher with de‑dup by (changeSetId, etag)
  • Task: Retry policy & DLQ integration
  • Feature: CloudEvents contracts & validators

  • Task: JSON Schema & contract tests

  • Task: JWS signing & JWKS rotation
  • Feature: Service Bus topology as code

  • Task: IaC for topic, subscriptions, filters

  • Task: SLOs & alarms (publish/consume error rate)

Epic: ECS Consumer SDK (.NET)

  • Feature: Subscription API + Options binding

  • Task: IEcsChangeSubscriber + host extensions

  • Task: Idempotency store (pluggable: memory/redis/sql)
  • Task: Snapshot & replay client
  • Feature: Observability hooks

  • Task: OTEL spans & metrics

  • Task: Structured logs with changeSetId

Epic: gRPC Watch & Webhooks

  • Feature: WatchConfig streaming service

  • Task: mTLS, auth interceptors, backpressure

  • Feature: Webhook sender

  • Task: HMAC/JWS signing, retry with exponential backoff

Epic: Testing & Chaos

  • Feature: Contract & resilience suite

  • Task: Duplicate/out‑of‑order injection tests

  • Task: Replay convergence E2E
  • Task: Load test harness & baseline SLOs

✅ Acceptance Criteria

  • A config write results in a CloudEvent published to the bus within < 250 ms p50.
  • Consumers that are offline converge using snapshot + replay with zero drift (checksums match).
  • At‑least‑once guaranteed; duplicates handled without side effects (idempotent apply proven in tests).
  • Versioning: monotonic configVersion increases and enforced; ETag required on updates.
  • Telemetry shows publish/consume/lag metrics per tenant and service; DLQ has actionable reasons.

🔭 Notes & Next Steps

  • Align policy‑driven overrides with event filters (tenant/edition rules → subscription rules).
  • Add Config Rollback Event (ConfigurationRolledBack) with automatic replay to target version.
  • Prepare multi‑region event replication plan (ASB geo‑disaster recovery / Kafka MirrorMaker).

ECS becomes a reactive, version‑aware, and verifiably reliable configuration backbone for the entire ConnectSoft ecosystem.


🚀 Caching & Performance

🎯 Goals

Design a caching and performance strategy that delivers sub‑30ms p95 resolve latency at factory scale, while preserving consistency, tenancy isolation, and deterministic resolution.

  • In‑memory + distributed cache (Redis) with precise invalidation
  • Snapshot lifecycle for fast cold‑start and replay
  • Scalable read path (high QPS) and controlled write path (safe propagation)

🧱 Cache Architecture (multi‑tier)

Tier Scope What it stores TTL / Consistency Purpose
L0 – Process cache Per API pod Resolved blobs by (tenantId, env, namespace, selector) + ETag Short TTL (e.g., 3–10s) + SWR Nanosecond access for hot prefixes
L1 – Redis Cluster‑wide Resolved blobs + prefix indexes; also recent snapshots TTL 30–120s (per key) + explicit invalidation Cross‑pod sharing, cuts DB pressure
L2 – DB SQL/Cosmos Authoritative bundles/versions Strong read Source of truth for cache misses

Key prefixes

ecs:v1:resolved:{tenant}:{env}:{ns}:{selector} -> {etag, version, json}
ecs:v1:index:{tenant}:{env}:{ns}              -> list of keys / etags
ecs:v1:snapshot:{tenant}:{env}:{ns}:{ver}     -> frozen state (blob id/ref)

🔁 Read Path (hot/cold)

sequenceDiagram
  participant S as Service SDK
  participant API as ECS API
  participant L0 as L0 Cache (memory)
  participant R as Redis (L1)
  participant DB as Store (SQL/Cosmos)
  S->>API: Resolve(tenant, env, ns, selector, If-None-Match)
  API->>L0: TryGet(etag)
  alt Hit
    L0-->>API: Resolved + etag
  else Miss
    API->>R: GET ecs:v1:resolved:...
    alt Hit
      R-->>API: Resolved + etag
      API->>L0: Put
    else Miss
      API->>DB: Query bundles (G→E→T→S→tags)
      API->>R: Set resolved + index
      API->>L0: Put
    end
  end
  API-->>S: 200/304 with etag
Hold "Alt" / "Option" to enable pan & zoom

Techniques

  • ETag + 304 to avoid payload transfer
  • SWR (stale‑while‑revalidate): serve slightly stale value while refreshing in background
  • Selector compaction: normalize selectors (e.g., sorted tags) to maximize hit rate

🧨 Invalidation & Coherency

On publish:

  1. Write transaction commits new version
  2. Outbox → Event Bus emits ConfigurationChanged
  3. Cache invalidator:

  4. Evict ecs:v1:resolved:* keys whose prefix intersects the changed key/namespace

  5. Bump namespace index to force L0 recalc on next read

Precision rules

  • Maintain key→prefix map (stored in Redis) to target minimal eviction set
  • Protect against stampedes with singleflight locks per key

🧊 Snapshot Lifecycle

Stage Detail
Create On schedule (e.g., hourly) or on demand; compute resolved state for (tenant, env, ns); store in Blob; index pointer in Redis ecs:v1:snapshot:*
Use Cold start or replay → fetch latest snapshot pointer, hydrate L1/L0 quickly
Rotate Keep last N (e.g., 72 hourly) per scope; delete older (configurable per edition)
Validate Checksums of snapshot vs live resolution to detect drift
Promote During incidents, mark a snapshot as fallback; read path returns snapshot when DB/Bus degraded

API (read):

  • GET /snapshot?tenantId&env&ns[&version] → blob ref + checksum
  • SDK can request snapshot on boot, then subscribe to events

⚖️ Scaling Reads vs Writes

Reads (dominant):

  • Scale API pods horizontally (HPA on RPS/latency)
  • Redis clustering (hash slot spreading on tenant|env|ns)
  • Hot‑tenant sharding: per‑tenant Redis db/index when necessary
  • Compression (LZ4) for large resolved blobs to reduce network

Writes (controlled):

  • Gate through Admin API with If‑Match and idempotency keys
  • Serialize heavy publish waves per namespace (per‑prefix semaphore)
  • Outbox throughput scaling with workers; batch invalidations

🧪 Performance Targets

  • p95 resolve latency: ≤ 30 ms (hot cache), ≤ 150 ms (cold)
  • Cache hit ratio: ≥ 85% overall; ≥ 95% for top 20 namespaces
  • Publish → visible (watchers): ≤ 3 s p95
  • Redis ops: P50 < 2 ms; 0 timeouts under 99.9th percentile
  • DB read QPS reduction: > 90% vs no‑cache baseline

🧰 Protection & Resilience

  • Thundering herd guard: per‑key singleflight + jittered backoff
  • Adaptive TTLs: increase TTL when bus lag detected to protect DB
  • Read‑degrade mode: serve last known snapshot on DB outage window
  • Rate limiting: per‑tenant read QPS caps; publish rate caps
  • Circuit breakers: around Redis/DB; fallback chain L0→snapshot

👩‍🔧 Tuning Playbook

  • Low hit ratio?

  • Check selector cardinality; introduce prefix bucketing or coarser namespaces

  • Enable response compression for large maps
  • High Redis CPU?

  • Increase sharding; switch to key hashing on tenant|ns

  • Raise L0 TTL modestly (5→10s) for hottest endpoints
  • Stampedes after publish?

  • Stagger invalidation; use incremental delta apply for L0 refresh

  • Limit concurrent cold resolvers with a token bucket

🗺️ Diagram — Cache & Snapshot Topology

flowchart LR
  API[Config API]:::svc -- L0 get/set --> L0[(Process Cache)]
  API -- get/set --> R[(Redis L1 Cluster)]
  API -- miss --> DB[(SQL/Cosmos)]
  Snap[Snapshotter]:::wrk -- resolved->blob --> Blob[(Snapshots)]
  Snap -- indexes --> R
  Pub[Publisher]:::wrk -- events --> Bus[(Event Bus)]
  Inv[Invalidator]:::wrk -- targeted evict --> R
  R -- hydrate --> L0

  classDef svc fill:#E0F2FE,stroke:#38BDF8;
  classDef wrk fill:#FFE4E6,stroke:#FB7185;
Hold "Alt" / "Option" to enable pan & zoom

✅ Acceptance Criteria

  • AC‑1: Writes invalidate only affected prefixes; unrelated namespaces retain ≥ 95% hit ratio.
  • AC‑2: Under 5k RPS sustained, p95 read latency ≤ 30 ms (hot), error rate < 0.5%.
  • AC‑3: During DB outage drills (5 min), API serves from snapshots with no 5xx spikes, and emits degraded mode metric.
  • AC‑4: Snapshot/restore produces byte‑identical resolved maps to live resolution for the same version.
  • AC‑5: Cache stampede tests show singleflight success (no >3x concurrent DB hits for same key).

🔧 Backlog → Azure DevOps

Epic: L0/L1 Cache & Invalidation

  • Feature: L0 in‑proc cache + SWR
  • Feature: Redis schema & prefix strategy
  • Feature: Precision invalidation worker
  • Tasks: key maps, singleflight, compression, TTL policy

Epic: Snapshot Lifecycle

  • Feature: Snapshotter worker & checksums
  • Feature: Fallback mode & API
  • Tasks: Blob storage model, retention policy, drift checker

Epic: Performance & Scale

  • Load test harness (Locust/K6)
  • Cache hit & latency dashboards
  • Auto‑tuning hooks (adaptive TTL, backoff)

📌 Notes

  • Consider optional edge cache (sidecar or local Redis) for ultra‑low latency per node.
  • For very large tenants, introduce namespace partitioning and hierarchical snapshots (per sub‑namespace).

ECS’s read path becomes fast, predictable, and resilient, and the write path safely fans out with precise cache coherence.


🛡️ Resiliency & Fault Tolerance

Goal: ensure ECS continues to serve safe, correct, and timely configuration under partial failures, regional incidents, and dependency degradation — without violating tenant isolation, security policies, or observability guarantees.


🎯 Objectives

  • Survive transient and regional faults with graceful degradation (read-mostly mode).
  • Protect dependent services via bounded retries, circuit breakers, bulkheads.
  • Fail fast on unsafe paths; serve stale-but-safe configs when allowed by policy.
  • Prove resilience with chaos drills, SLAs/SLOs, and automated failover runbooks.

🧨 Failure Model (ECS)

Surface Typical Faults Primary Mitigations
Read path (GET config) Redis miss, origin DB latency, network partitions Local snapshot, Redis cluster, hedged reads, timeouts
Write path (PUT/PATCH) DB leader loss, quorum fail, version conflict Optimistic concurrency, idempotent writes, queue-backed commit
Eventing (change notifications) Service Bus/Kafka outage, consumer lag Outbox + retry, backfill from ledger, idempotent subscribers
AuthN/Z IdP outage, token validation latency Token cache, STS fallback keys, mTLS pinning
Secrets Key Vault throttling Per-tenant secret cache, exponential backoff, jitter
Region AZ/region failure Active–active reads, active–standby writes, DNS/AFD failover

🧱 Resilience Architecture

flowchart LR
  Client((Service)) -->|gRPC/REST| Edge[ECS API Gateway]
  Edge --> CB{Circuit\nBreaker}
  CB --> L1[(In-Proc Snapshot Cache)]
  L1 -->|miss| L2[(Redis Cluster)]
  L2 -->|miss| Origin[(Cosmos DB / Postgres Primary)]
  Origin --> Ledger[(Change Ledger / Event Outbox)]
  Ledger --> Bus[(Service Bus / Kafka)]
  Bus --> Subscribers[[Microservices/Agents]]

  subgraph Regional Pair
    Origin---OriginReplica[(Geo-Replicated Read)]
    Redis[(Redis)]---RedisReplica[(Geo-Replica)]
  end
Hold "Alt" / "Option" to enable pan & zoom
  • Reads: L1 (in-proc) → L2 (Redis) → Origin (Cosmos/PG). Serve stale snapshot within TTL if origin slow.
  • Writes: Single-writer per partition (tenant/namespace). Outbox pattern persists change, publishes event.
  • Change propagation: At-least-once from outbox → bus; consumers are idempotent (version vector, ETag).

🗂️ Fallback & Degradation Policies

Scenario ECS Behavior Notes
Origin slow (>p95 threshold) Stale-OK read from Redis or local snapshot if policy.allowStale=true and TTL valid Emit degraded_mode=true metric, add X-Config-Stale-Age
Redis unavailable Skip L2, fall back to local snapshot; increase hedged read to origin Shorter timeouts to avoid threadpool exhaustion
Bus outage Writes commit to origin + outbox table; background publisher replays when bus recovers Consumers dedupe by (tenantId, key, version)
Tenant-scoped incident Trip per-tenant breaker; serve last good tenant snapshot; block writes for tenant Prevents blast radius
Global auth outage Honor cached token validations (bounded TTL); keep mTLS; reduce JWK rotations temporarily Never bypass RBAC

⚙️ .NET Resilience Profile (Polly)

services.AddHttpClient("EcsOrigin")
    .AddPolicyHandler(Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromMilliseconds(300)))
    .AddPolicyHandler(Policy<HttpResponseMessage>.Handle<Exception>()
        .WaitAndRetryAsync(3, retry => TimeSpan.FromMilliseconds(50 * Math.Pow(2, retry)) + Jitter()))
    .AddPolicyHandler(Policy<HttpResponseMessage>.Handle<Exception>()
        .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)));

static TimeSpan Jitter() => TimeSpan.FromMilliseconds(Random.Shared.Next(0, 40));
  • Bulkheads per endpoint to cap concurrent origin calls.
  • Hedged requests (optional): fire a second read to replica after p95 latency.

🧾 Config Snapshot Lifecycle

  1. Acquire: On successful read from origin, build ConfigSnapshot { tenantId, ns, version, etag, data, capturedAt }.
  2. Store: L1 (memory) + L2 (Redis, key: cfg:{tenant}:{ns}:{version}) with TTL & size guard.
  3. Serve: Prefer latest version; if origin fails, use highest valid within policy TTL.
  4. Invalidate: On change event → purge L1, update L2; add etag to prevent stale overwrite.
  5. Audit: Record snapshot usage (fresh vs stale) with traceId, tenantId.

🔁 Idempotency & Versioning

  • Writes: Require If-Match: <ETag> or version. On conflict → 409 with latest pointer.
  • Events: Include (aggregateId, version); consumers ignore if already applied.
  • Replay: On recovery, publisher replays outbox by createdAt and notified=false.

🧪 Chaos Testing Scenarios

Area Fault Injection Expected Outcome
Cache Kill Redis primary; network partition Stale reads from L1; perf dip within SLO; no 5xx spikes
Origin 500/timeout storm; leader failover Stale-OK reads; write breaker trips → read-only mode banners
Bus Drop topic; throttle Outbox backlog grows; No lost events; catch-up within RTO
Auth JWK endpoint down Cached keys honored; no auth bypass
Region Simulate regional fail Traffic manager to secondary reads; write drain & promote in ≤ RTO

Schedule GameDays monthly; record hypotheses, metrics, and remediations.


📈 SLOs & SLA Envelope (Proposed)

Dimension Target Notes
Availability (reads) 99.99% monthly With stale-OK serving
Availability (writes) 99.9% monthly May block during conflict/region failover
p95 read latency ≤ 20ms (cache hit), ≤ 120ms (origin) Per-tenant
Event propagation ≤ 3s p95 From commit to first consumer delivery
RPO ≤ 30s Config ledger + cross-region replication
RTO (regional) ≤ 15 min Promote secondary; re-point writers
Error budget (reads) 4m 22s / month Tracked in SRE dashboard

Contracts: SLA doc states credit policy if monthly availability below target; per-tenant SLOs are observable (dashboards & reports).


🌍 Failover Architecture

  • Reads: Active–Active (multi-region replicas for origin + Redis replica); DNS/AFD/Envoy locality.
  • Writes: Active–Standby per partition (tenant/namespace). Single-writer enforced by lease (Cosmos) or advisory lock (PG).
  • Data:

  • Cosmos DB: multi-region write (optional) with conflict resolver = highest version.

  • PostgreSQL: logical replication; promote with pg_auto_failover; ensure write fences during switchover.
  • Secrets: Key Vault geo-redundant, soft-delete + purge protection; client caches secrets with TTL.
sequenceDiagram
  participant Client
  participant ECS
  participant Redis_Primary
  participant DB_Primary
  participant DB_Secondary

  Client->>ECS: GET /config
  ECS->>Redis_Primary: TryGet
  alt miss/timeout
    ECS->>DB_Primary: Read
    DB_Primary--xECS: timeout
    ECS->>DB_Secondary: Hedged Read
    ECS-->>Client: 200 (stale-ok), X-Config-Stale-Age
  else hit
    ECS-->>Client: 200 (fresh)
  end
Hold "Alt" / "Option" to enable pan & zoom

🧩 Read/Write Mode Matrix

Mode Reads Writes Trigger Recovery Signal
Normal Fresh Allowed Healthy deps N/A
Degraded Stale-OK Allowed Origin latency > threshold Latency normalizes
Read-Only Stale-OK Blocked Origin unavailable; version conflicts DB healthy + catch-up complete
Failover Fresh via secondary Allowed after promote Region incident Health gates + leader elected

🔔 Telemetry & Alerts (Resilience Signals)

  • Counters: config_stale_served_total{tenantId}, outbox_backlog_size, breaker_open_total{scope}.
  • Gauges: snapshot_age_seconds, publish_lag_seconds.
  • SLO: burn-rate alerts 2%/1h and 5%/6h on read availability.
  • Events: DegradedModeEntered, ReadOnlyModeEntered, FailoverStarted, FailoverCompleted.

🧰 Ops Runbooks (abridged)

  1. Degraded Mode Identify hotspot tenants → increase per-tenant TTL; ensure Redis healthy; verify breaker state.
  2. Write Conflicts Inspect conflict keys; return 409 with latest ETag; advise client retry using backoff.
  3. Bus Backlog Scale publishers/partitions; verify outbox replay; validate consumer idempotency.
  4. Regional Failover Freeze writers; promote secondary; run consistency checks; unfreeze by partition.

🔐 Safety & Compliance

  • Stale reads respect edition/tenant policy overlays; never cross-tenant.
  • No secret material in snapshots; secrets are references resolved at call time with cache TTL.
  • All degraded/RO decisions are audited (who, why, since).

✅ Deliverables

  • Resilience ADRs: cache-first reads, outbox + at-least-once, active–standby writes.
  • .NET resilience profile (Polly) and configuration schema: resilience: { allowStale: true, staleTtlSec: 120, readTimeoutMs: 300, retries: 3, breaker: { failCount: 5, breakSec: 30 } }
  • Chaos plan & scripts (fault maps), SLO dashboards, runbooks.

🔜 Epics / Azure DevOps (seed)

  • ECS-RES-01: Implement snapshot cache (L1/L2) with TTL & audit.
  • ECS-RES-02: Outbox publisher + idempotent consumer SDK.
  • ECS-RES-03: Circuit breaker/bulkhead/hedged reads middleware.
  • ECS-RES-04: Degraded/Read-only mode controller + health endpoints.
  • ECS-RES-05: Chaos suite & monthly GameDay pipeline.
  • ECS-RES-06: Multi-region failover automation + runbooks.
  • ECS-RES-07: SLO burn-rate alerts & resilience dashboard.

With these guards, ECS remains predictable under pressure, protecting downstream services while providing observability and control to ops and agents.


🔄 Deployment, CI/CD & DevOps Enablement

🎯 Goals

Provide a repeatable, secure, observable path to deliver ECS across dev → test → staging → prod, with immutable packaging, automated rollouts/rollbacks, and zero‑downtime upgrades.


📦 Packaging (Artifacts & IaC)

Artifact Type Tooling Purpose Notes
Container Image Dockerfile → ACR Runtime unit for API/Workers Signed, SBOM attached, Trivy‑scanned
K8s Charts Helm ECS components (API, Streamer, Workers) Values overlays per env/tenant
Azure Infra Bicep / Terraform / Pulumi AKS, ACR, Key Vault, SQL/PG, Redis, Service Bus Policy‑as‑code (Azure Policy/OPA)
Migrations EF Core / Flyway DB schema & data migrations Forward‑only + rollback plan
Release Bundle .tar.gz Helm chart + values + migration scripts + release notes SemVer tag (e.g., ecs-1.6.3)

Helm values overlays

/deploy/helm/values-dev.yaml
/deploy/helm/values-test.yaml
/deploy/helm/values-staging.yaml
/deploy/helm/values-prod.yaml

🔁 Dev/Test/Prod Rollout

flowchart TD
  A[Commit: src + IaC] --> B[CI: Build & Scan]
  B --> C[Push: ACR + Chart Repo]
  C --> D[CD: Dev Deploy]
  D --> E[Smoke + Contract Tests]
  E --> F[Test/Staging Deploy]
  F --> G[Perf & Chaos Gates]
  G --> H[Manual Approval]
  H --> I[Prod: Blue/Green or Canary]
  I --> J[Post‑deploy Verification + Auto Rollback if failing]
  J --> K[Tag + Audit + Release Notes]
Hold "Alt" / "Option" to enable pan & zoom

Rollout strategies

  • Blue/Green for API pods (Envoy/AGIC switch on health).
  • Canary (e.g., 10% → 50% → 100%) guarded by SLO checks (p95 latency, error rate, consumer lag).
  • KEDA scales workers by outbox depth / schedule (snapshotter).

🧬 Versioning & Migration Automation

Semantic Versioning

  • MAJOR.MINOR.PATCH for ECS; backward‑compatible APIs within MINOR.
  • Configuration schema versions tracked in repo (JSON Schema URIs).

DB Migrations

  • EF Core/Flyway run as pre‑install Helm hooks:

  • hook: pre-install, pre-upgrademigration-job

  • Idempotent; write migration ledger to SQL.
  • On failure: abort upgrade, auto‑rollback to previous chart, publish incident event.

Config Schema Evolution

  • Contracts validated at publish time; compatibility checks (breaking change requires force=true + elevated RBAC).
  • Data backfills via worker job (post‑upgrade hook), observable via ecs_backfill_pending.

Zero‑Downtime Policy

  • API pods roll with maxUnavailable=0.
  • Sticky read cache retained; consumers use ETag + watchers to avoid reload storms.

🧪 CI/CD Pipelines (Azure DevOps YAML — excerpt)

stages:
- stage: CI
  jobs:
  - job: build
    steps:
    - task: Docker@2
      inputs: { command: buildAndPush, repository: ecs/api, tags: $(Build.BuildNumber) }
    - task: HelmInstaller@1
    - script: helm lint deploy/helm/ecs
    - task: TrivyScan@1
    - task: PublishBuildArtifacts@1

- stage: CD_Dev
  dependsOn: CI
  jobs:
  - deployment: dev
    environment: ecs-dev
    strategy:
      runOnce:
        deploy:
          steps:
          - script: helm upgrade --install ecs deploy/helm/ecs -f deploy/helm/values-dev.yaml --set image.tag=$(Build.BuildNumber)
          - script: ./ops/smoke.sh https://ecs-dev.internal

- stage: CD_Prod
  dependsOn: CD_Staging
  approval: Manual
  jobs:
  - deployment: prod
    environment: ecs-prod
    strategy:
      canary:
        increments: [10,50,100]
        deploySteps:
        - script: helm upgrade --install ecs ...
        - script: ./ops/verify-slo.sh --latency-p95 30 --errors 0.5
        onFailure:
        - script: ./ops/rollback.sh

🔒 Security & Policy Gates (shift‑left)

  • Image & chart scanning (Trivy/Grype) — block on HIGH/CRITICAL.
  • IaC checks (Checkov/OPA) — deny insecure networking, public KV, missing TLS.
  • Secrets: all values sourced from Key Vault references; no secrets in values files.
  • Sign & verify: Cosign images; Helm chart provenance (.prov).

📊 Release Observability

  • Pipeline emits deployment events with traceId, version, change set.
  • Golden signals checked during canary: ecs_resolve_latency_ms p95, 5xx rate, ecs_event_fanout_lag_ms.
  • Auto‑rollback if thresholds breached for 5–10 minutes window.
  • Release notes generated from commits + PR labels (features/fixes/breaking).

🗂️ Repo & Environment Layout (suggested)

/src/ecs-api
/src/ecs-workers
/sdk/dotnet
/deploy/helm/ecs
/deploy/bicep|tf|pulumi
/ops/scripts (smoke, verify-slo, rollback, snapshot-restore)
/migrations (db, schema)
/docs/hld

✅ Acceptance Criteria

  • AC‑1: One‑button pipeline promotes dev → test → staging → prod with signed artifacts and policy gates.
  • AC‑2: Zero‑downtime upgrade verified (no 5xx spikes > 0.5% and p95 ≤ targets during rollout).
  • AC‑3: Failed canary auto‑rolls back to last healthy release; audit entries recorded.
  • AC‑4: Migrations are idempotent, logged, and observable; rollback plan documented and tested.
  • AC‑5: All secrets consumed via Key Vault references; pipeline blocks on any hardcoded secret detection.

🔧 Backlog → Azure DevOps

Epic: Packaging & IaC

  • Feature: Helm chart + env overlays
  • Feature: Bicep/Terraform/Pulumi modules for AKS, Redis, SQL/PG, ASB, KV
  • Task: Chart provenance + image signing

Epic: Pipelines & Gates

  • Feature: CI build/scan/sign
  • Feature: CD with canary/blue‑green + auto‑rollback
  • Feature: SLO verifiers & deployment events

Epic: Migrations & Schema Evolution

  • Feature: Migration runner (hooks) + ledger
  • Feature: Config schema compat checker
  • Feature: Backfill job + observability

Epic: Ops Tooling

  • Feature: Smoke/verify/rollback scripts
  • Feature: Release notes generator
  • Feature: Disaster‑recovery playbook (snapshot restore)