🛠 External Configuration Server¶
🎯 Vision¶
The External Configuration Server (ECS) will serve as a centralized, multi-tenant, secure, and dynamic configuration service for all microservices in the ConnectSoft AI Factory ecosystem. It allows runtime reconfiguration without redeployment, ensures tenant isolation, and enforces policy-driven overrides across editions and environments.
ECS transforms configuration from a static file-driven activity → into a policy-aware, observable, and externalized runtime service.
🌍 Problem Context¶
Traditional microservices suffer from:
- 📄 Config files baked into deployments (requiring restarts on change)
- 🌐 Scattered config sources (env vars, files, feature flags, secrets)
- ❌ No central observability (difficult to trace config changes)
- 🧩 Multi-tenant conflict (different tenants require different policies)
In a factory-scale system with 3000+ agents and microservices, this creates risk, inconsistency, and operational drag.
🧩 Purpose of ECS¶
ECS is designed to:
| Objective | ECS Responsibility |
|---|---|
| Centralize configuration | Single source of truth for runtime and environment variables |
| Enable dynamic updates | Push config changes at runtime without redeployment |
| Enforce multi-tenancy & editions | Per-tenant and per-edition overrides with RBAC enforcement |
| Ensure observability | Full audit trail, metrics, and tracing of configuration requests & mutations |
| Integrate with ecosystem | Provide APIs, gRPC, and event-driven notifications for microservices and agents |
🏗 Position in Architecture¶
ECS acts as an infrastructure microservice in the ConnectSoft ecosystem. It is consumed by:
- All domain microservices → for runtime configs
- Architect Agents → for blueprint-level config references
- DevOps Orchestrators → for rollout, environment management, and secrets injection
- QA Cluster → for validating edition/role-specific behavior under config variants
- Studio UI → for editing and visualizing tenant-specific config
🧬 High-Level Diagram – ECS in Factory¶
flowchart TD
subgraph Factory Microservices
A1[Service A] --> ECS
A2[Service B] --> ECS
A3[Service C] --> ECS
end
subgraph Core Infrastructure
ECS[External Configuration Server]
Vault[(Secrets / Key Vault)]
Bus[(Event Bus)]
end
subgraph Agents
ArchitectAgent --> ECS
DevOpsAgent --> ECS
QAAgent --> ECS
StudioUI --> ECS
end
ECS --> Vault
ECS --> Bus
📘 Principles¶
- DDD-Centric: Config treated as a domain object with aggregates (TenantConfig, EditionConfig).
- Clean Architecture: Ports/adapters for REST, gRPC, Event Bus.
- Event-Driven: Publish config changes as CloudEvents.
- Observability-First: All reads/writes are traceable with OpenTelemetry.
- Security-Hardened: Role-based access, integration with OpenIddict/Azure AD.
- Cloud-Native: Designed for AKS, scaling with caching + distributed persistence.
The External Configuration Server will become the backbone of runtime flexibility in the ConnectSoft AI Factory, enabling:
- 🔄 Safe, dynamic reconfiguration
- 🏢 Tenant- and edition-aware overrides
- 📊 Observability and auditability
- ⚡ Reliable integration via APIs, gRPC, and event streams
Without ECS, the Factory remains brittle, environment-dependent, and costly to scale across tenants.
🛠 Functional & Non-Functional Requirements¶
📋 Functional Requirements¶
The External Configuration Server (ECS) must deliver a comprehensive feature set to support the ConnectSoft Factory’s scale and principles.
| Area | Functional Requirement | Description / Notes |
|---|---|---|
| Configuration Management | CRUD for Config Objects | Create, update, delete, and retrieve configuration entities (TenantConfig, EditionConfig, ServiceConfig). |
| Hierarchical Config Resolution | Resolve configuration in layers: Global → Edition → Tenant → Service. Supports overrides and fallbacks. | |
| Versioned Config | All configurations are versioned to allow rollbacks, diffs, and audit trails. | |
| Dynamic Updates | Config changes are pushed in real time via gRPC streams / Event Bus. | |
| Environment-Aware Config | Separate resolution paths for Dev, Test, Staging, Production. | |
| Multi-Tenancy | Tenant Isolation | Each tenant’s configuration is logically and physically isolated. |
| Edition Overrides | Editions can apply overrides at the feature or property level. | |
| Access & Security | Role-Based Access Control (RBAC) | Fine-grained policies: who can read, write, or override configs. |
| Integration with Identity Providers | Supports OpenIddict and Azure AD for authentication and authorization. | |
| Secret Handling | Sensitive values delegated to Key Vault/Secrets Manager (not stored in ECS DB). | |
| APIs & Interfaces | REST API | CRUD + query endpoints. |
| gRPC Interface | High-performance communication for microservices. | |
| Event Streaming | Publish config changes as CloudEvents to Kafka/Service Bus/NATS. | |
| Admin UI / Studio Integration | Visualization and management of configs. | |
| Observability | Full Audit Logging | Every read/write recorded with user/service identity and traceId. |
| Metrics | Config read/write latency, cache hit/miss, per-tenant stats. | |
| Distributed Tracing | OpenTelemetry traces link config requests with service spans. |
📊 Non-Functional Requirements¶
| Category | Requirement | Target / Notes |
|---|---|---|
| Scalability | Horizontal scaling | ECS must handle thousands of config reads/sec across 3000+ services. |
| Caching Layer | Configs cached in-memory + distributed cache (Redis) with invalidation on change. | |
| Availability | High availability | ≥99.9% SLA, active-active across regions. |
| Zero-downtime updates | Config server upgrades must not disrupt consumers. | |
| Performance | Low-latency reads | <50ms per config resolution under load. |
| High throughput | Support ≥20K config resolutions/min per cluster. | |
| Security | Data encryption | TLS 1.3 in transit, AES-256 at rest. |
| Secret isolation | No secrets persisted; all resolved via Vault/Key Vault. | |
| Reliability | Idempotency | Config change events deduplicated using semanticHash + traceId. |
| Retry mechanisms | Client SDKs and server ensure at-least-once delivery. | |
| Compliance & Audit | Full traceability | Retain config change history for ≥2 years. |
| Policy enforcement | Built-in compliance checks (e.g., no empty values for critical configs). |
🧬 Diagram – Config Resolution Layers¶
flowchart TB
Global[Global Config] --> Edition[Edition Config]
Edition --> Tenant[Tenant Config]
Tenant --> Service[Service Config]
Service --> Final[Resolved Runtime Config]
Final -->|Delivered to| Microservice
Interpretation:
- Global defaults apply first.
- Editions override global.
- Tenants override editions.
- Services override tenants.
- Result = fully resolved, policy-compliant runtime config.
The External Configuration Server must support multi-level, versioned, observable, and secure configuration management at scale. Key requirements include:
- Functional: CRUD, versioning, overrides, real-time updates, multi-tenant enforcement.
- Non-Functional: High availability, low-latency reads, strong security, full audit trail.
These requirements form the acceptance criteria for the PRD, guiding architecture and implementation decisions.
🛠 Domain Model & DDD Aggregates¶
🧩 Domain-Driven Design Context¶
The ECS is not just a key-value store; it is a domain service where configuration is a first-class aggregate with policies, lifecycle, and traceability. We treat configuration as structured, multi-level domain objects that evolve over time.
🏗 Key Aggregates & Entities¶
| Aggregate / Entity | Description | Key Properties | Relationships |
|---|---|---|---|
| TenantConfig (Aggregate Root) | Represents the configuration context of a specific tenant. | TenantId, ConfigSet, Overrides, CreatedBy, UpdatedBy, Version |
Contains multiple ServiceConfigs; inherits defaults from EditionConfig. |
| EditionConfig (Aggregate Root) | Captures configuration defaults for a specific product edition (e.g., Basic, Pro, Enterprise). | EditionId, ConfigSet, Overrides, PolicyRules, Version |
Linked to many TenantConfigs; overrides GlobalConfig. |
| GlobalConfig (Aggregate Root) | Global system-wide defaults applied before edition or tenant overrides. | GlobalId, DefaultValues, Policies, Version |
Parent to EditionConfig. |
| ServiceConfig (Entity) | Service-specific configuration for a tenant (e.g., BillingServiceConfig). |
ServiceId, ConfigItems, Version |
Child of TenantConfig. |
| ConfigItem (Value Object) | Atomic key-value pair with metadata. | Key, Value, Type, IsSecret, IsFeatureToggle |
Immutable; part of ServiceConfig or higher config. |
| ConfigVersion (Entity) | Snapshot of configuration state at a point in time. | VersionId, Timestamp, ChangeLog, Hash |
Belongs to any Config Aggregate for rollback. |
| PolicyRule (Entity) | Validation and enforcement rules (e.g., “must not be empty”, “allowed range”). | RuleId, Scope, Expression, Severity |
Applied during config creation/updates. |
| AuditLog (Entity) | Records all reads/writes with full context. | LogId, Actor, Action, TraceId, Timestamp |
Linked to TenantConfig/EditionConfig changes. |
📚 Relationships in ECS¶
erDiagram
GlobalConfig ||--o{ EditionConfig : defines
EditionConfig ||--o{ TenantConfig : appliesTo
TenantConfig ||--o{ ServiceConfig : contains
ServiceConfig ||--o{ ConfigItem : holds
TenantConfig ||--o{ ConfigVersion : snapshots
TenantConfig ||--o{ AuditLog : records
PolicyRule ||--o{ ConfigItem : validates
⚖️ Aggregate Responsibilities¶
- GlobalConfig – foundation of defaults across the ecosystem.
- EditionConfig – edition-specific overrides and policies.
- TenantConfig – tenant-specific configuration scope, isolation, and service configs.
- ServiceConfig – service-level overrides within a tenant.
- ConfigItem – immutable value objects ensuring integrity.
- ConfigVersion – ensure traceability, rollback, and deterministic replays.
- AuditLog – maintain compliance and accountability.
- PolicyRule – ensure configs remain valid and secure.
📘 Domain Events¶
ECS aggregates will emit domain events (later mapped to CloudEvents):
ConfigCreatedConfigUpdatedConfigDeletedConfigVersionedConfigRollbackPerformedPolicyViolationDetected
These events feed event sourcing, observability, and downstream automation (QA, DevOps, Agents).
We’ve defined the domain model for ECS using DDD principles:
- Aggregates: GlobalConfig, EditionConfig, TenantConfig
- Supporting entities: ServiceConfig, ConfigVersion, PolicyRule, AuditLog
- Value objects: ConfigItem
- Domain events for traceability
This structure ensures clear boundaries, scalability, and auditability of configurations across 3000+ services.
🛠 High‑Level Architecture (HLA)¶
🎯 Architecture Goals & Constraints¶
Goals
- Low‑latency, highly available config reads for 3k+ services
- Deterministic resolution across Global → Edition → Tenant → Service
- Real‑time change propagation (streaming + events)
- Tenant/edition isolation with RBAC and full auditability
- Cloud‑native deployment (AKS), observability‑first, security‑first
Constraints
- No secrets at rest in ECS (resolve by reference to Key Vault)
- Clean Architecture + DDD boundaries
- Event‑driven integration (CloudEvents over MassTransit/Azure Service Bus)
- Idempotent writes and cache‑safe reads
🧱 Clean Architecture Layers (Ports & Adapters)¶
| Layer | Responsibilities | Examples (ECS) |
|---|---|---|
| Domain | Aggregates, policies, invariants, domain events | TenantConfig, EditionConfig, PolicyRule, ConfigVersion |
| Application | Use cases, orchestration, transactions | ResolveConfig, PublishChange, CreateSnapshot, Rollback |
| Interfaces (Adapters) | REST, gRPC, events, CLI/Admin UI adapter | Controllers, gRPC services, Event publisher/subscriber |
| Infrastructure | Persistence, cache, secrets, bus, telemetry | NHibernate/Azure SQL, Redis, Blob, Key Vault client, MassTransit |
🗺️ Component Topology (Logical)¶
graph TD
subgraph Clients
AppSDK[Client SDK (.NET/JS/Java)]
Studio[Studio/Admin UI]
end
subgraph ECS Core
API[Config API (REST/gRPC)]
AdminAPI[Admin API]
Resolver[Resolver Service]
Policy[Policy Engine]
Notifier[Change Notifier]
Streamer[Watch/Streamer]
Snapshotter[Snapshot & Archive Worker]
Projector[Audit Projector]
end
subgraph Data & Infra
SQL[(Azure SQL / PostgreSQL)]
Redis[(Redis Cache)]
Blob[(Blob Storage: Snapshots)]
KV[(Azure Key Vault)]
Bus[(Azure Service Bus / Kafka via MassTransit)]
OTEL[(OTel Exporters)]
end
AppSDK-->API
Studio-->AdminAPI
API-->Resolver
Resolver-->Policy
Resolver--hot read-->Redis
Resolver--strong read-->SQL
Resolver--secret refs-->KV
AdminAPI--writes-->SQL
AdminAPI--emit events-->Bus
Notifier--fanout-->Bus
Streamer--server push-->AppSDK
Snapshotter--store-->Blob
Projector--audit views-->SQL
API--telemetry-->OTEL
AdminAPI--telemetry-->OTEL
🔁 Core Execution Flows¶
1) Read/Resolve (hot path)¶
sequenceDiagram
participant S as Service (Client SDK)
participant API as Config API
participant R as Resolver
participant C as Redis Cache
participant DB as SQL Store
participant KV as Key Vault
S->>API: GET /v1/config/resolve?tenant&env&app&prefix
API->>R: Resolve(request, principal, scopes)
R->>C: TryGet(prefix, etag)
alt CacheHit
C-->>R: Resolved payload + etag
else Miss/Stale
R->>DB: Query hierarchical config (G→E→T→S)
R->>KV: Resolve secret references (if any)
R->>C: Set(prefix, resolved, etag, TTL)
end
R-->>API: Resolved config + etag
API-->>S: 200 OK (ETag, traceId)
Notes
- ETag for client‑side caching; If‑None‑Match supported.
- Secret values never stored in SQL/Redis — only resolved at read via Key Vault.
2) Write/Publish (admin path)¶
sequenceDiagram
participant Admin as Admin UI/CLI
participant AAPI as Admin API
participant DB as SQL
participant Policy as Policy Engine
participant Bus as Event Bus
participant C as Redis
Admin->>AAPI: POST /v1/config/{scope} (If-Match: etag)
AAPI->>Policy: Validate (syntax, rules, scope, RBAC)
Policy-->>AAPI: OK/Violation
alt OK
AAPI->>DB: Upsert aggregate (new version)
AAPI->>Bus: Publish ConfigUpdated(cloudevent)
AAPI->>C: Invalidate affected prefixes
AAPI-->>Admin: 202 Accepted (versionId)
else Violation
AAPI-->>Admin: 400/409 with details
end
Notes
- Optimistic concurrency via ETag/Version.
- Idempotency key (semanticHash, traceId) to dedupe repeated submissions.
3) Watch/Streaming updates¶
- gRPC streaming or SSE/WebSocket.
- Server pushes diffs and new etag/version when relevant keys change.
- SDK applies delta and refreshes local cache; raises typed callbacks.
4) Snapshot & Rollback¶
- Snapshotter runs scheduled jobs or on‑demand to persist full resolved views to Blob.
- Rollback creates a new version from a prior snapshot (never mutates history).
- Projector maintains audit/read models for Studio timelines.
🧠 Tenancy, Editions & Scoping¶
- Scope model: Global → Edition → Tenant → Environment → Service → Tag(s)
- Deterministic precedence; last‑write wins at same level with version guards.
- All queries filtered by tenantId, edition, environment, and RBAC scopes.
- Policy Engine enforces forbidden prefixes and required keys per scope.
🚀 Performance & Caching Strategy¶
- Read‑optimized: Redis front‑cache for prefixes/namespaces; stale‑while‑revalidate.
- Write invalidation: precise key‑space eviction (prefix wildcards).
- Batched DB queries with computed overlay merge at the server (single roundtrip).
- Hot paths instrumented with P95 targets; background warmers for critical tenants.
🔐 Security Architecture¶
- AuthN: OIDC (OpenIddict/AAD).
- AuthZ: Tenant‑scoped RBAC (Admin/Operator/Reader/Auditor). Fine‑grained by key prefix and scope.
- Data: TLS 1.3, encryption at rest, allow‑list of output content types (JSON/YAML/INI).
- No secret persistence: store only references; resolve via Key Vault with managed identity.
- Signed CloudEvents to prevent tampering; event payload links to audit record.
📡 Observability & Audit¶
- OpenTelemetry traces on every request (
traceId,tenantId,actor,scope,etag,result). - Metrics: read latency, cache hit ratio, per‑tenant QPS, write throughput, policy violations, stream fan‑out lag.
- Structured logs with redaction and PII guards.
- Audit Projector provides timeline views: who changed what, when, why (reason/issue link).
🧩 Persistence Model¶
- Relational store (Azure SQL/PostgreSQL) for aggregates, versions, policies, audit projections.
- Blob for snapshots/archives and large documents.
- Redis for hot key‑spaces (read path).
- Event Bus for reactive consumers (cache invalidation, stream fan‑out, CI hooks).
🛠 Technology Choices (concrete)¶
- .NET 8 / ASP.NET Core
- NHibernate (relational persistence), FluentValidation for DTOs
- MassTransit + Azure Service Bus (events)
- Redis (StackExchange.Redis), Azure SQL/PG (primary store), Azure Blob, Key Vault
- gRPC for SDK channel; REST for admin and operational APIs
- OpenTelemetry + Serilog; Grafana/Prometheus/App Insights
🧯 Reliability Patterns¶
- Idempotent writes (semanticHash + traceId)
- Retry with jitter for transient DB/Bus ops; circuit breakers around KV/DB
- Backpressure on stream fan‑out; per‑tenant rate limits
- Dead‑letter topics for failed events; replay from snapshots
🧪 Failure Scenarios & Mitigations (examples)¶
| Scenario | Mitigation |
|---|---|
| Redis outage | Fallback to DB read (degraded latency), disable SWR, increase result TTL on recovery |
| Key Vault throttling | Cache positive resolutions with short TTL; exponential backoff; circuit to “secret‑unavailable” marker |
| Event Bus lag | Streamer polls on pull‑mode fallback; admin banner indicates lagged propagation |
| Hot‑spot tenant | Per‑tenant cache partitioning; prefix sharding; SDK exponential backoff |
☸️ Deployment Topology (AKS)¶
- Stateless API pods behind internal LB; HPA based on RPS/latency
- Streamer replicas scaled by subscription count
- Workers (Snapshotter/Projector) with KEDA triggers (queue depth/CRON)
- Zonal redundancy; rolling upgrades; PodDisruptionBudgets
- Helm charts with environment overlays; GitOps optional
📑 ADR Backlog (to be authored)¶
- ADR‑001: Hierarchical Resolution Strategy (server‑side overlay, determinism)
- ADR‑002: Secrets by Reference vs inline storage (KV integration)
- ADR‑003: Event Transport selection (ASB vs Kafka) and CloudEvents schema
- ADR‑004: Cache Topology (prefix caches, invalidation, SWR)
- ADR‑005: Streaming Protocol (gRPC vs SSE/WebSockets) and backpressure
- ADR‑006: Persistence (Azure SQL vs PostgreSQL) and NHibernate mappings
- ADR‑007: RBAC Model (tenant/environment/key‑prefix scopes)
- ADR‑008: Snapshot & Rollback semantics (append‑only versioning)
- ADR‑009: Observability Defaults (required spans, logs, metrics)
We defined the ECS high‑level architecture: clean layering, core components, execution flows, tenancy/security enforcement, observability, and cloud‑native deployment. This blueprint is ready for detailed APIs, schemas, and SDK contracts.
✅ ECS PRD: Functional & Non‑Functional Requirements (Ready for ADO Epics)**¶
Scope: External Configuration Server (ECS) for multi‑tenant, edition‑aware, policy‑driven runtime configuration across ConnectSoft microservices (.NET, Clean Architecture, DDD, EDA). Outcome: A complete PRD slice that we can decompose into Azure DevOps Epics/Features/Stories in the next cycle.
🧭 Product Scope & Boundaries¶
In‑scope
- Centralized configuration resolution and delivery for services (REST/gRPC/SDK).
- Multi‑tenant + multi‑edition overrides with policy enforcement (RBAC, ABAC).
- Environment layering (global → environment → edition → tenant → service → instance).
- Versioning, rollout, preview, audit, and rollback.
- Observability (metrics, logs, traces, audit trail) and event notifications.
- Integrations: Azure Service Bus (events), Azure Key Vault (secrets reference), Redis (edge cache).
Out‑of‑scope (v1)
- Secrets storage (managed by Key Vault; ECS stores references).
- Feature experimentation framework (flags supported; experiments later).
- UI Studio admin (initially minimal CRUD UI; advanced UX later).
🧩 Domain Model (DDD)¶
Aggregates
Tenant(TenantId, Name, Status, Editions[])Edition(EditionId, Name, PolicySet)ConfigBundle(BundleId, Scope, Keys[], Version, CreatedBy, CreatedAt)ConfigItem(Key, Value, Type, SchemaRef, Metadata)Policy(RBAC roles, ABAC conditions, constraints e.g., max TTL)ResolutionRequest(ServiceId, TenantId, EditionId, Environment, InstanceTags[])ResolutionResult(ResolvedMap, VersionGraph, SourceTrace[], ETag)
Scopes & Precedence (highest wins)
- Instance (Pod/Slot)
- Service (Microservice)
- Tenant
- Edition
- Environment (dev/stage/prod)
- Global
Conflict resolution = closest scope wins, then latest effective version, then policy constraint.
🧠 Core Functional Requirements¶
1) Read/Resolve¶
- FR‑R1: Resolve configuration at runtime:
Resolve(serviceId, tenantId, editionId, env, tags[]) → map - FR‑R2: Conditional resolution with preview mode (no write/audit side‑effects).
- FR‑R3: Strong caching: ETag support; 304 semantics; per-scope TTL.
- FR‑R4: Watch/Subscribe for change events via gRPC streaming or SSE.
- FR‑R5: ResolveTrace: return provenance (which scopes/versions produced each key).
2) Write/Manage¶
- FR‑W1: CRUD for Bundles & Items per scope; batch upserts with schema validation.
- FR‑W2: Versioning on each change; FR‑W3: Rollback to prior version.
- FR‑W4: Policy aware writes (RBAC roles + ABAC e.g., “only Ops may change prod”).
- FR‑W5: Dry‑run validation (schema + policy + conflict preview).
3) Policy & Security¶
- FR‑P1: RBAC roles:
ECS.Admin,ECS.Editor,ECS.Auditor,ServiceAccount. - FR‑P2: ABAC conditions: environment, tenant, edition, label selectors.
- FR‑P3: Audit every read/write (who/what/when/from where, masked values).
- FR‑P4: Secrets as references:
kvref://{vault}/{secretName}@{version}
4) Events & Integration¶
- FR‑E1: Emit CloudEvents:
ConfigChanged,PolicyChanged,BundlePublished,RollbackPerformed. - FR‑E2: Outbox pattern to Azure Service Bus; retries + idempotency keys.
- FR‑E3: SDK callback hooks for hot‑reload in services.
5) Observability¶
- FR‑O1: OTEL spans for resolve & write; FR‑O2: metrics (p95 resolve latency, cache hit ratio, 4xx/5xx).
- FR‑O3: Structured logs with
traceId,tenantId,serviceId,scope,keyCount.
🔒 Security & Compliance Requirements (NFR‑Security)¶
- NFR‑S1: OIDC (OpenIddict/Azure AD). Service‑to‑service via client credentials; scopes per operation.
- NFR‑S2: Data encryption at rest (DB) + TLS in transit.
- NFR‑S3: PII/secret redaction in logs & audit streams; schema flags:
sensitivity: pii|secret. - NFR‑S4: Multi‑tenant isolation in data partition & query filters; per‑tenant keys/ETags.
⚙️ Performance & Reliability (NFR‑Perf/Rel)¶
- NFR‑P1: p95 resolve ≤ 30 ms (hot path with Redis); cold ≤ 150 ms.
- NFR‑P2: Peak read QPS ≥ 20k (horizontally scalable API replicas).
- NFR‑R1: Availability SLO 99.95% for read APIs.
- NFR‑R2: Zero‑downtime publish; rolling upgrades; blue/green for storage migrations.
- NFR‑C1: Config size per resolution ≤ 512 KB (soft limit), item size ≤ 16 KB (hard).
🧪 Quality & Validation (NFR‑QA)¶
- NFR‑Q1: Contract tests for REST/gRPC (& SDK) with golden fixtures.
- NFR‑Q2: Chaos tests: cache node loss, bus outage, DB failover; system remains read‑available (stale‑ok).
- NFR‑Q3: Security tests for RBAC/ABAC bypass attempts; negative path coverage ≥ 95% rules.
🌉 External Interfaces¶
REST (subset)¶
GET /v1/resolve?serviceId&tenantId&editionId&env&tags=...→200 {data, eTag, trace}GET /v1/watch?serviceId...(SSE) →event: ConfigChangedPOST /v1/bundles/{scope}(upsert; JSON Schema validation)POST /v1/preview/resolve(dry‑run)POST /v1/rollback(bundleId, targetVersion)
gRPC¶
Resolve()unary;Watch()server stream with backoff hints.
Events (Azure Service Bus topics)¶
ecs.config.changedecs.bundle.publishedecs.policy.changedecs.bundle.rolledback
🧰 Client SDK (dotnet)¶
Package: ConnectSoft.Ecs.Client
Features:
- Typed options binding:
services.AddEcsConfig<TOptions>("namespace:keyPrefix") - Hot reload via
IOptionsMonitorand gRPC watch - Circuit‑breaker + per‑key caching with ETag
- Secrets resolver (
kvref://) with Key Vault client
builder.Services.AddEcsClient(o =>
{
o.ServiceId = "billing-service";
o.Environment = "prod";
o.Edition = "enterprise";
o.TenantIdProvider = () => TenantContext.Current?.TenantId;
});
🏗️ High‑Level Architecture¶
flowchart LR
subgraph Clients
SvcA[Service A]:::svc -->|Resolve/Watch| API
SvcB[Service B]:::svc -->|Resolve/Watch| API
end
subgraph ECS
API[REST/gRPC API]:::core
Resolver[Resolver Engine]:::core
Policy[Policy Engine RBAC/ABAC]:::core
Cache[(Redis Edge Cache)]:::infra
Store[(Config DB)]:::infra
Outbox[Outbox]:::core
Bus[[Azure Service Bus]]:::infra
Audit[(Audit Store)]:::infra
KV[[Azure Key Vault]]:::infra
end
API --> Resolver --> Cache
Resolver <--> Store
Resolver --> Policy
Resolver --> KV
API --> Outbox --> Bus
API --> Audit
classDef core fill:#E0F2FE,stroke:#38BDF8,stroke-width:1.2px;
classDef infra fill:#F5F3FF,stroke:#8B5CF6,stroke-width:1.2px;
classDef svc fill:#ECFCCB,stroke:#84CC16,stroke-width:1.2px;
Storage options (pluggable via Clean Architecture):
- Primary: SQL (PostgreSQL/SQL Server) via NHibernate.
- Optional: Cosmos DB provider.
- Cache: Redis (clustered).
🔁 Resolution Algorithm (Deterministic)¶
- Identify scope chain from request metadata.
- Load latest effective versions for each scope (bundle snapshots).
- Merge maps top‑down (global→env→edition→tenant→service→instance).
- Apply policy constraints (deny/override/mask).
- Expand secret references (Key Vault) if caller has scope (
allowInlineSecrets=falseby default). - Compute ETag (stable hash); return map + SourceTrace.
Edge cases:
- Missing keys → default value policy (deny/allow default).
- Conflicts → highest precedence wins; policy can block high precedence if violating constraints.
📊 Observability Contract¶
Metrics
ecs_resolve_latency_ms(p50/p95/p99)ecs_cache_hit_ratioecs_resolve_qpsecs_write_errors_totalecs_stream_clients_gauge
Logs & Traces
- Enriched with
traceId,tenantId,serviceId,scope,bundleId,version, redaction flags.
Audit
who, when, what, where (ip/user-agent), before/after diff (masked).
🚀 Delivery & Rollout¶
- Blue/Green API pods, Redis with keyspace notifications off (we push invalidate signals via Bus).
- Write path: transactional write → outbox → publish → cache invalidate (fan‑out key patterns).
- Hot Reload: streaming watchers receive
ConfigChangedwith ETag → services re‑resolve.
🧯 Failure Modes & Recovery¶
- Cache outage → fallback to DB (latency up; circuit policy).
- Bus outage → outbox retry; watchers retry with jitter.
- DB outage → serve last-known ETag from cache for
Nminutes (stale‑ok policy); emits degraded events.
📐 Acceptance Criteria (Representative)¶
- AC‑R‑01: Given a tenant + service + edition chain, when a config key exists in multiple scopes, then the service‑scoped value is returned and SourceTrace lists all overridden values.
- AC‑W‑02: A write with invalid schema returns
422and does not create a new version; audit shows failure reason. - AC‑P‑03: A user with
ECS.Editorcannot write toenv=prodwithoutabac: env=prodcondition →403. - AC‑E‑04: On successful publish, a
ConfigChangedevent is emitted and ≥95% subscribed services observe change within 3s (under nominal load). - AC‑O‑05: p95 resolve latency ≤ 30 ms with warm cache under 5k RPS.
📦 Data Model (Simplified)¶
ConfigBundle:
bundleId: guid
scope:
level: Global|Environment|Edition|Tenant|Service|Instance
identifiers: { environment?, editionId?, tenantId?, serviceId?, instanceId? }
version: int
items: [ConfigItem]
tags: [string]
checksum: string
createdBy: userId
createdAt: datetime
status: Draft|Published|Deprecated
ConfigItem:
key: string
type: string # string|int|bool|json|uri|kvref
value: any
schemaRef: uri?
metadata: { sensitivity?: pii|secret, ttl?: int, description?: string }
🗺️ Azure DevOps Backlog Shape (Preview — detailed breakdown next cycle)¶
Epic A — Resolution & Delivery
- Feature A1: Resolve API + gRPC + ETag
- Feature A2: Watch/Subscribe streaming
- Feature A3: Redis caching + invalidation
Epic B — Authoring & Versioning
- Feature B1: Bundle/Item CRUD + batch upsert
- Feature B2: Versioning & rollback
- Feature B3: Preview/dry‑run + diff
Epic C — Policy & Security
- Feature C1: RBAC/ABAC engine
- Feature C2: Audit & redaction
- Feature C3: OIDC integration & scopes
Epic D — Events & SDK
- Feature D1: CloudEvents + Outbox to ASB
- Feature D2: .NET SDK (OptionsMonitor, KeyVault resolver)
- Feature D3: Sample integration in 2 reference services
Epic E — Observability & SLOs
- Feature E1: OTEL spans, metrics, logs
- Feature E2: Dashboards & SLO checks
- Feature E3: Chaos/resiliency tests
📌 Definitions of Done (per Feature)¶
- ✅ API/gRPC contract approved (OpenAPI/proto + breaking change check).
- ✅ Unit/integration/contract tests ≥ 85% line/branch in feature scope.
- ✅ OTEL + logs + metrics + audit events present & validated.
- ✅ Security checks (RBAC/ABAC/PII redaction) pass.
- ✅ Load tests meet SLOs (p95, error rate, cache hit).
- ✅ Docs: README, runbook, examples, SDK snippet.
🔗 Integration Points & APIs¶
🎯 Purpose¶
The External Configuration Server (ECS) must expose consistent, secure, and flexible APIs to integrate with all consumers in the ConnectSoft AI Factory ecosystem. These interfaces enable real-time retrieval, updates, and subscriptions to configuration values, ensuring both human and machine actors interact seamlessly.
📡 Integration Interfaces¶
| Interface Type | Purpose | Consumers |
|---|---|---|
| REST API | CRUD operations on tenant/edition configs, metadata management | Studio UI, DevOps tools, external systems |
| gRPC | High-performance, strongly typed API for runtime config resolution | Microservices, Agents (low-latency config fetch) |
| Event Bus | Publish/subscribe to configuration changes as CloudEvents v1.0 | Domain services, Architect Agents, QA Agents |
| SDKs/Libraries | Thin client libraries (C#, TypeScript, Python) for easy config injection | Factory microservices, test harnesses |
📘 API Contracts¶
REST¶
GET /config/{tenant}/{service}→ Retrieve active configPOST /config/{tenant}/{service}→ Create/update configGET /history/{tenant}/{service}→ Fetch config audit trailPOST /rollback/{tenant}/{service}→ Rollback to prior version
gRPC¶
ResolveConfig(ResolveRequest) → ResolveResponseStreamConfigUpdates(SubscriptionRequest) → stream ConfigEvent
Events (CloudEvents)¶
- Type:
factory.ecs.v1.config.updated - Attributes:
tenantId,editionId,serviceName,version,traceId,semanticHash - DataRef: points to signed configuration snapshot in blob store
🧩 SDK & Client Libraries¶
-
C# ECS.Client
-
ConfigProvider.Resolve(service, tenant) ConfigProvider.Subscribe(service, tenant, callback)-
TypeScript SDK (for UI & frontend agents)
-
Hooks for live subscription updates
-
Python SDK (for AI agents & ML services)
-
Easy access to tenant configs in training/inference pipelines
🏗 Diagram – ECS Integration Surfaces¶
flowchart LR
StudioUI -->|REST| ECS
DevOpsAgent -->|REST| ECS
MicroserviceA -->|gRPC| ECS
MicroserviceB -->|gRPC| ECS
ArchitectAgent -->|Events| ECS
QAAgent -->|Events| ECS
ECS -->|SDKs| ClientLibs[(C#/TS/Python SDKs)]
📘 Principles for Integration¶
- Idempotency: All API updates are idempotent (semanticHash-based).
- Traceability: Each API call returns a
traceIdfor correlation. - Security-First: All endpoints secured with OAuth2 scopes (e.g.,
ecs.read,ecs.write). - Versioning: APIs versioned (
/v1/,vNext) with strong backward compatibility. - Polyglot Support: SDKs generated using OpenAPI/gRPC tooling for multiple languages.
➡️ With these integration points, ECS becomes an accessible backbone service, allowing any microservice, agent, or human operator to retrieve and react to configuration in real-time.
🏢 Multi-Tenancy & Edition Management¶
🎯 Purpose¶
The External Configuration Server (ECS) must support multi-tenancy and edition-aware configuration management as a first-class concern. This ensures that different customers (tenants) and their specific editions (pricing or functional tiers) can receive isolated, policy-driven configurations without conflict.
🌍 Multi-Tenancy Principles¶
- Isolation by Tenant → Configurations are stored, resolved, and audited per tenant boundary.
- Shared Infrastructure, Segregated Data → ECS runs as a shared service, but configuration data is tenant-scoped.
- RBAC-Scoped Access → Only authorized roles (e.g., Tenant Admin, Factory Operator) can mutate tenant-specific configs.
- Noisy Neighbor Protection → Rate limiting and quotas prevent one tenant from overloading ECS capacity.
🧩 Edition Awareness¶
- Base Business Model → Provides default configuration values (global, edition-agnostic).
- Edition Overrides → Editions (e.g., Free, Standard, Enterprise) override global values for functionality toggles, limits, or integrations.
- Tenant Overrides → Individual tenants may further override edition defaults, within defined policies.
- Policy Guardrails → Edition-level limits (e.g., “Enterprise supports unlimited API calls, Free has 10k/month”) are enforced automatically.
📘 Hierarchical Resolution Model¶
Config resolution follows a hierarchical override chain:
Example:
- Default:
MaxUsers = 100 - Edition: Enterprise →
MaxUsers = 1000 - Tenant ABC (Enterprise) →
MaxUsers = 1500(if allowed by policy)
🔐 Security & Governance¶
- Per-Tenant Encryption → Config values encrypted with tenant-specific keys in Azure Key Vault.
- Audit Scope → Every change linked to tenant, edition, actor, and
traceId. - Cross-Tenant Safety → ECS enforces absolute separation—no tenant can read another’s configuration.
🏗 Diagram – Multi-Tenant & Edition Config Flow¶
flowchart TD
Global[🌐 Global Config] --> Edition[📦 Edition Config]
Edition --> Tenant[🏢 Tenant Config]
Tenant --> Runtime[⚡ Runtime Override]
Runtime --> ECS[(External Config Server)]
ECS --> ServiceA[Service A]
ECS --> ServiceB[Service B]
📊 Example Use Case¶
| Tenant | Edition | Config Key | Value |
|---|---|---|---|
| Global | Default | FeatureX |
OFF |
| All | Enterprise | FeatureX |
ON |
| ABC | Enterprise | FeatureX |
ON, custom rate 500/s |
| XYZ | Free | FeatureX |
OFF |
📘 Principles for Multi-Tenant Management¶
- Edition-Aware Inheritance: Every tenant config inherits from its edition baseline.
- Policy-Driven Overrides: Tenants can override edition configs only within guardrails.
- Strict Isolation: No cross-tenant data visibility or leakage.
- Auditability: All tenant/edition config changes fully traceable.
- Dynamic Runtime Resolution: Overrides are resolved on every config fetch (no redeploys).
➡️ With this, ECS becomes SaaS-ready: one service, securely serving thousands of tenants with multiple editions, while preserving isolation, scalability, and flexibility.
💾 Persistence & Storage Design¶
🎯 Purpose¶
The External Configuration Server (ECS) requires a robust persistence layer to store, version, and retrieve configuration data across tenants, editions, and environments. This cycle defines how ECS persists configuration artifacts, ensures durability, and enables auditability at scale.
🏛 Core Principles¶
- Multi-Model Storage → Structured metadata in SQL; flexible config documents in NoSQL.
- Immutable Versioning → Every change creates a new version; rollback is always possible.
- Separation of Concerns → Config metadata (who/when/where) stored apart from config payloads (what).
- Tenant & Edition Isolation → Partitioning strategies prevent cross-tenant conflicts.
- Cloud-Native Durability → Built on Azure SQL + Cosmos DB + Key Vault with geo-redundancy.
🗂 Data Model¶
Entities¶
-
ConfigItem
-
ConfigId(GUID) TenantIdEditionIdEnvironment(Dev, QA, Prod)-
PayloadRef(pointer to storage blob/NoSQL doc) -
ConfigVersion
-
VersionId ConfigId(FK)Hash(SHA-256 for integrity)CreatedAtCreatedBy(User/Service ID)-
ChangeReason -
AuditLog
-
EntryId ConfigIdVersionIdActorAction(Create, Update, Rollback)TraceId
🗄 Storage Technologies¶
| Layer | Technology | Purpose |
|---|---|---|
| Metadata | Azure SQL | Strong consistency for IDs, references, indexes |
| Payload Storage | Azure Cosmos DB | Flexible JSON docs for config payloads |
| Secrets | Azure Key Vault | Encrypt sensitive values per-tenant |
| Backups/Archives | Azure Blob | Cold storage of snapshots & exports |
🔄 Versioning & History¶
- Every config update produces a new immutable version.
-
Versions can be queried:
-
Latest only (default)
- Specific version (for rollback/testing)
- Range queries (audit & troubleshooting)
- ECS guarantees event-sourced persistence → “What changed, when, and by whom” is always reconstructable.
🗃 Partitioning & Indexing¶
- TenantId + EditionId = Primary partition key in Cosmos DB.
- Environment indexed for fast lookups.
- Hot configs cached in Redis for ultra-low-latency retrieval.
🔐 Security¶
- All payloads encrypted at rest (AES-256).
- Per-tenant encryption keys in Key Vault.
- Fine-grained RBAC on read/write at DB layer.
- Write operations only through ECS API (no direct DB access).
🏗 Diagram – Persistence Flow¶
flowchart TD
App[Client Service] --> ECS[(External Config Server)]
ECS --> SQL[(Azure SQL Metadata)]
ECS --> COSMOS[(Cosmos DB Payloads)]
ECS --> KeyVault[(Azure Key Vault)]
ECS --> Blob[(Azure Blob Snapshots)]
📊 Example¶
- Tenant: VetClinicCo
- Edition: Enterprise
- Env: Production
- Config:
MaxConcurrentJobs = 500
Version history:
- v1 → 100 (default)
- v2 → 200 (edition override)
- v3 → 500 (tenant override, policy-approved)
All versions persist in Cosmos DB with metadata in SQL and secured values in Key Vault.
✅ Principles Recap¶
- Immutable Versioning – No destructive updates.
- Multi-Model Storage – SQL + NoSQL hybrid design.
- Per-Tenant Isolation – Partition + encryption per tenant.
- Audit-First – Every change logged with traceability.
- Cloud-Native Resilience – High availability + geo-redundant backup.
🔐 Security & Access Control¶
🎯 Goals¶
The External Configuration Server (ECS) must be secure by design, ensuring that only authenticated and authorized actors can access or modify configuration data. Security must address multi-tenancy, API surface hardening, and end-to-end trust across environments.
🧩 Security Requirements¶
| Requirement | Implementation Strategy |
|---|---|
| Authentication | OpenIddict / Azure AD integration for OAuth2 & OIDC. |
| Authorization (RBAC/ABAC) | Fine-grained role-based and attribute-based access per tenant, edition, and environment. |
| Tenant Isolation | Scoped tokens that prevent cross-tenant data access. Tenant ID enforced in every query. |
| Secure Communication | gRPC over mTLS + REST endpoints with TLS 1.3. |
| Secrets Handling | Integrate with Key Vault (Azure Key Vault or HashiCorp Vault) for secret material. |
| Policy-Driven Overrides | Apply policies that enforce configuration rules at runtime (e.g., edition cannot override tenant global). |
| Audit & Traceability | Every read/write secured with claims and logged with trace context. |
🔑 Authentication & Identity Flow¶
sequenceDiagram
participant Client as Microservice/Agent
participant ECS as External Config Server
participant IDP as Identity Provider (OpenIddict/Azure AD)
Client->>IDP: Request Token (Client Credentials / On-Behalf-Of)
IDP-->>Client: Access Token (JWT w/ claims)
Client->>ECS: Call API w/ Bearer Token
ECS->>IDP: Validate Token (signature, expiry, audience, tenant claims)
ECS-->>Client: Authorized Response (config values)
- Supported Grants: Client Credentials (service-to-service), Authorization Code (Studio UI), On-Behalf-Of (agent delegation).
- JWT Validation: Issuer, audience, expiry, tenant ID, roles.
🛡️ Authorization Model¶
Roles¶
- ConfigAdmin → Full control over tenant configs.
- ConfigEditor → Create/update configs within tenant scope.
- ConfigReader → Read-only access.
- SystemObserver → Access to audit/logs for compliance.
Enforcement¶
- RBAC enforced at API layer (per endpoint).
- ABAC used for policy enforcement (e.g., “Edition config overrides allowed only if flag enabled”).
🏢 Tenant Isolation¶
- Each request must carry a TenantID claim (in token).
- ECS enforces row-level isolation at persistence level (per-tenant schemas or partition keys).
- No cross-tenant visibility in APIs or events.
🔐 Secrets Handling¶
- ECS does not store secrets directly; instead, it stores references (URIs, Key Vault keys).
-
Integration with:
-
Azure Key Vault for cloud deployments.
- HashiCorp Vault for hybrid setups.
- ECS fetches secrets at runtime with least-privilege identity (Managed Identity in Azure).
🔏 Secure API Surface¶
- REST → secured via TLS 1.3 + OAuth2 Bearer tokens.
- gRPC → secured via mTLS (mutual certificate-based auth) + OAuth2.
- Rate limiting & throttling to prevent abuse.
- CORS restrictions for Studio UI.
⚖️ Policy-Driven Overrides¶
- Global → Tenant → Edition → Environment → Service hierarchy.
-
Policies enforce:
-
No tenant override of system-critical keys.
- Edition policies can restrict tenant-level overrides.
- Environments can enforce stricter security keys (prod > dev).
📊 Audit & Compliance¶
-
Every config read/write logged with:
-
Actor identity (user/service).
- Tenant & edition.
- Trace context (OpenTelemetry).
- Immutable audit logs → stored in append-only storage (e.g., Azure Blob immutability policies).
- Compliance: SOC2, GDPR (data access restrictions), ISO 27001 alignment.
✅ ECS gains end-to-end trust, fine-grained authorization, and policy-driven safeguards—making it resilient against misuse while enabling flexible tenant/edition operations.
📈 Observability & Monitoring¶
🎯 Goals¶
Give the External Configuration Server (ECS) complete, actionable visibility over read/write paths and propagation so teams can detect, debug, and prevent issues fast.
- Metrics: SLOs/SLA tracking for resolve latency, availability, cache hit ratio, event lag.
- Tracing: End-to-end spans across client → ECS → cache/DB → Key Vault → bus.
- Logging: Structured, privacy-safe logs with correlation.
- Audit Trails: Immutable evidence of who changed what, when, where, and why.
- Dashboards & Alerts: Grafana/App Insights views and guardrail alerts wired to on-call.
🔭 Signal Model (four pillars)¶
| Pillar | Purpose | Where |
|---|---|---|
| Metrics | SLOs, trends, capacity | Prometheus/App Insights Metrics |
| Traces | Causality & latency breakdown | OpenTelemetry → OTLP exporter |
| Logs | Forensic details & exceptions | Serilog → ELK/App Insights |
| Audit | Compliance-grade change evidence | Append-only store + query view |
🧩 Telemetry Architecture¶
flowchart LR
Client[(Service/SDK)] --OTEL ctx--> API[Config API]
API --> Resolver
API --> AdminAPI
Resolver --> Redis[(Redis)]
Resolver --> SQL[(SQL)]
Resolver --> KV[(Key Vault)]
AdminAPI --> Bus[[Event Bus]]
API & AdminAPI --> OTEL[OpenTelemetry Exporter]
OTEL --> Prom[Prometheus]
OTEL --> AI[App Insights]
Logs[Serilog/JSON] --> ELK[(ELK/App Insights Logs)]
AdminAPI --> Audit[(Immutable Audit Store + Projections)]
📊 Metrics (names, labels, targets)¶
Service-level
ecs_resolve_requests_total{tenant,env,service}ecs_resolve_latency_ms{quantile=50|95|99, tenant, env}→ p95 ≤ 30ms (hot), ≤150ms (cold)ecs_resolve_errors_total{code};rate(…[5m]) < 0.5%ecs_cache_hit_ratio{tenant}→ ≥ 0.85ecs_event_fanout_lag_ms{topic}→ p95 < 2000msecs_stream_clients_gauge{node}(active watchers)ecs_write_requests_total{op=create|update|rollback}ecs_write_policy_violations_total{policyId, severity}
Infra
redis_ops_total{cmd},redis_ttl_seconds_bucketsql_query_latency_ms{query}kv_request_latency_ms{op=getSecret}(sampled)bus_publish_latency_ms{topic}
SLO rollups
slo_availability_ratio(read API) → ≥ 99.95%slo_freshness_ratio(watchers receive change < 3s) → ≥ 95%
Store as Prometheus counters/histograms; mirror key KPIs into App Insights for product owners.
🧵 Tracing (OpenTelemetry)¶
- Trace root:
ResolveConfig(read) /WriteConfig(write) -
Spans:
-
cache.get(prefix)/cache.set(prefix) sql.query(hierarchy)/sql.upsert(bundle)kv.resolve(secretRef)(redacted attributes)bus.publish(ConfigChanged)- Trace context:
trace_id,span_id,tenantId,env,serviceId,editionId,etag,version - Baggage (lightweight):
request_class=hot|cold,cache_hit=true|false
Sampling
- Default 10% for hot read path; 100% for errors/slow spans (
p95+), 100% for writes.
🧾 Logging (structured & safe)¶
- Format: JSON, Serilog.
- Required fields:
ts,level,message,traceId,spanId,tenantId,env,serviceId,actor,operation,status,durationMs - Redaction: never log secret values; mask values with
"<redacted>"; log key paths and hashes only. -
Error taxonomy:
-
ValidationError(422),PolicyViolation(403),ConcurrencyConflict(409),UpstreamTimeout(502),InternalFailure(500).
Example (write path)
{
"ts":"2025-08-21T18:25:43Z",
"level":"Information",
"message":"Config bundle published",
"traceId":"3f1f…",
"tenantId":"t-abc",
"env":"prod",
"operation":"PublishConfig",
"bundleId":"b-77b2",
"version":42,
"keysTouched":128,
"policyViolations":0,
"etag":"W/\"d41d8c\""
}
🧑⚖️ Audit Trails (immutable & queryable)¶
- What is audited: All writes (create/update/delete/rollback), policy decisions, and admin reads of sensitive scopes.
- Record:
auditId,actor,actorType(user|service),action,scope(global|edition|tenant|service|instance),beforeHash,afterHash,reason,ip,userAgent,traceId,ts. -
Storage:
-
Append-only (e.g., Blob with immutability policy / WORM).
- Projection into SQL table(s) for timelines and Studio queries.
- Retention: 24 months (configurable by tenant plan).
- Export: Signed CSV/JSON with evidence chain (hash of file published to audit ledger topic).
📺 Dashboards (Grafana/App Insights)¶
Operations (NOC)
- Read latency (p50/p95/p99) by env & tenant
- Error rate by route
- Cache hit ratio
- Event fan-out lag
- Stream clients trend
SRE
- DB query time heatmap; Redis saturation; Key Vault throttling; Bus publish latency
- Top tenants by QPS; hotspot keys/prefixes
- SLA/SLO burn-down (error budget)
Product/Compliance
- Changes by tenant/edition over time
- Policy violations & severities
- Audit export status
🚨 Alerts (examples)¶
| Condition | Expression (PromQL-ish) | Action |
|---|---|---|
| p95 resolve latency > 60ms for 10m | hist_quantile(0.95, rate(ecs_resolve_latency_ms_bucket[10m])) > 60 |
Page SRE (P2) |
| Error rate > 1% for 5m | rate(ecs_resolve_errors_total[5m]) / rate(ecs_resolve_requests_total[5m]) > 0.01 |
Page SRE (P2) |
| Cache hit < 70% for 15m | avg_over_time(ecs_cache_hit_ratio[15m]) < 0.7 |
Ticket + Slack (P3) |
| Event lag p95 > 5s for 5m | hist_quantile(0.95, rate(ecs_event_fanout_lag_ms_bucket[5m])) > 5000 |
Page on-call (P2) |
| Audit write failures > 0 | increase(ecs_audit_write_errors_total[5m]) > 0 |
Page SRE (P1) |
🧪 Observability DoD (per feature)¶
- OTEL spans present with correct attributes and parentage.
- Metrics emitted with required labels (
tenantId,env,serviceId). - Logs redact PII/secret content; include correlation IDs.
- Audit entries created for all write operations with before/after hashes.
- Dashboards panels updated; alert rules validated in staging chaos runs.
🛠 Runbook Snippets¶
Trace a slow read
- Locate alert → pick
traceId. - In traces, check spans:
cache.get(miss?) →sql.query(slow?) →kv.resolve(throttled?). - If cache misses spike, check invalidation storm from recent publish.
- Mitigate: raise TTL temporarily, throttle publishers, enable SWR.
Missing change in consumers
- Check
ecs_event_fanout_lag_ms& outbox backlog. - Verify subscriber count; look for DLQ on the topic.
- If lag high: scale Notifier/Streamer; apply backpressure config.
🔒 Privacy & Compliance¶
- PII/Secrets: never stored in metrics/traces/logs; only hashes and key paths permitted.
- Data residency: metrics/logs stored regionally per tenant policy where required.
- Access: dashboards segmented by role; audit exports require elevated scope and MFA.
✅ Outcome¶
ECS now has operational telemetry, forensic visibility, and compliance-grade auditability. With dashboards and alerts, issues are caught early and traced precisely—supporting our SLOs and tenant commitments.
⚡ Eventing & Change Propagation¶
How ECS publishes configuration changes and guarantees safe, observable fan‑out across the factory.
🎯 Goals¶
- Make every configuration change a first‑class event (CloudEvents) with tenant/edition scope.
- Provide low‑latency fan‑out to thousands of services via Azure Service Bus (default) and Kafka (optional).
- Guarantee at‑least‑once delivery with consumer idempotency, retries, DLQ, and replay.
- Support versioned configurations (optimistic concurrency, ETags, semantic versions, snapshots, and deltas).
- Preserve traceability:
traceId,changeSetId,configVersionon every hop via OpenTelemetry.
🧩 Event Model¶
Event Envelope (CloudEvents 1.0)¶
{
"specversion": "1.0",
"type": "com.connectsoft.ecs.ConfigurationChanged",
"source": "/ecs/tenants/{tenantId}/namespaces/{ns}/keys/{configKey}",
"id": "evt_{changeSetId}",
"time": "2025-08-21T11:25:43.123Z",
"datacontenttype": "application/json",
"subject": "tenant:{tenantId}|edition:{edition}|env:{environment}",
"traceparent": "00-{traceId}-{spanId}-01",
"dataschema": "https://schemas.connectsoft.dev/ecs/v1/events/configuration-changed.json",
"data": {
"tenantId": "t-001",
"edition": "enterprise",
"environment": "prod",
"namespace": "payments",
"configKey": "payment.retryPolicy",
"changeSetId": "cset_7f2a7c",
"configVersion": "2025.08.21+00023",
"operation": "Upsert", // Upsert | Delete | Patch
"payloadType": "Delta", // Snapshot | Delta
"etag": "W/\"b9-1e9b\"",
"checksum": "sha256:3be2e...",
"previousVersion": "2025.08.21+00022",
"changedBy": "user:alice@connectsoft.ai",
"reason": "Increase retries for PSP flakiness",
"ttlSeconds": 600
}
}
Why CloudEvents? Uniform schema for REST webhooks, Service Bus topics, Kafka topics, and gRPC streaming — minimizing adapters.
🛰️ Transports & Channels¶
| Transport | ECS Role | Default Topology | Notes |
|---|---|---|---|
| Azure Service Bus | Primary | Topic ecs.config.changed.v1 → Subscriptions per consumer group (service) |
Delivery & retry semantics built‑in; rules for tenant/edition filters |
| Kafka | Optional | Topic ecs.config.changed.v1 with tenant/edition partitions |
High‑throughput; consumer offset storage |
| gRPC Server Streaming | Optional | WatchConfig() per (tenantId, namespace, selector) |
For in‑cluster low‑latency watchers |
| Webhooks | Optional | Signed HTTP POST with CloudEvents envelope | For third‑party SaaS integrations |
Topic Partitioning
- Service Bus: use subscriptions with correlation filters on
tenantId,edition,namespace,configKey. - Kafka: partition by
hash(tenantId|namespace)for locality; key =tenantId|namespace|configKey.
🔄 Publisher Flow (ECS)¶
sequenceDiagram
participant Client as Studio/API Client
participant ECS as ECS API
participant OUTBOX as ECS Outbox
participant BUS as Service Bus / Kafka
participant OTel as OTEL Exporter
Client->>ECS: PUT /configs/{tenant}/{ns}/{key} (If-Match: etag)
ECS->>ECS: Validate, authorize, persist (version + etag)
ECS->>OUTBOX: Append Outbox record (changeSetId, payload)
ECS->>OTel: Span: ecs.config.write
OUTBOX-->>BUS: Publish CloudEvent
BUS-->>*Consumers: Deliver event (at-least-once)
- Outbox pattern ensures atomic persistence + event publish (transactional outbox + background dispatcher).
- Idempotent publish via
(changeSetId, etag)de‑duplication at dispatcher.
🧠 Consumer Contract (SDK)¶
All ECS client SDKs (.NET first) expose a uniform consumption model:
public interface IEcsChangeSubscriber
{
Task OnConfigurationChangedAsync(ConfigurationChangeEvent evt, CancellationToken ct);
}
// Usage:
await ecsSubscriber.SubscribeAsync(new EcsSubscription
{
TenantId = "t-001",
Namespace = "payments",
Selector = "payment.*", // glob / regex
StartingVersion = "2025.08.21+00020"
});
Consumer Responsibilities (enforced by SDK defaults):
- Idempotency: persist
lastAppliedVersionper(tenantId, namespace, configKey); skip ifevt.configVersion <= lastApplied. - Transactional Apply: update local cache + notify live options atomically; on failure, retry with exponential backoff.
- Rehydration: on startup, call
GET /snapshot?...since=lastAppliedVersion, then process backlog from bus. - Backpressure: apply bounded queues; expose
ecs_consumer_backlog_size.
🧷 Delivery Semantics & Reliability¶
At‑Least‑Once Delivery¶
- Service Bus: max delivery attempts (e.g., 10), dead‑letter after exhaustion.
- Kafka: enable.auto.commit=false, commit offsets after successful apply.
Idempotency Patterns¶
| Layer | Mechanism |
|---|---|
| Publisher | Outbox de‑dup on (changeSetId, etag) |
| Transport | Deterministic key for partitioning |
| Consumer | lastAppliedVersion store + checksum verification |
Retries¶
- Publisher: dispatcher retries with jitter, circuit‑breakers to BUS.
- Consumer: SDK retries transient errors; poison events → DLQ with dead‑letter reason (
checksum_mismatch,authorization_revoked,schema_incompatible).
Replay¶
- Replay API:
GET /events/stream?tenantId=...&fromVersion=...&toVersion=...(signed, time‑boxed). - Use cases: Disaster recovery, blue/green cutover, warm caches.
🧱 Versioning Strategy¶
| Concept | Purpose | Implementation |
|---|---|---|
| ETag / If‑Match | Optimistic concurrency | W/"b9-1e9b" round‑trips on write |
| Monotonic Version | Ordering | yyyy.MM.dd+build (e.g., 2025.08.21+00023) |
| Semantic Version (optional) | Compatibility signals | 1.4.0 attached as configSemVer |
| Snapshots | Full state at point‑in‑time | GET /snapshot?tenantId&namespace |
| Deltas | Network efficiency | Event payloadType: Delta with patch ops (RFC6902 JSON Patch) |
| Schema Evolution | Safe change | Avro/JSON Schema; dataschema URI versioned; compat checks at publish |
Server Rules
- Breaking change detected? Require
force=true&ack=...header + elevated RBAC role + audit record. - Automatically emit compat warning events when consumers with old
consumerSchemaVersionare detected (telemetry handshake).
🧳 Change Types¶
ConfigurationChanged(delta/snapshot)ConfigurationDeletedNamespacePolicyChanged(RBAC/edition rules)SecretRotated(metadata only; values are out‑of‑band via Vault references)BulkConfigurationChanged(batch changes carry list of keys and a singlechangeSetId)
🔐 Security & Integrity (Event Path)¶
- mTLS for gRPC; AMQP over TLS for Service Bus; SASL/SSL for Kafka.
- Claims in event metadata:
sub,roles,policyHash. - Signature: optional JWS detached signature of
datafor tamper detection (consumers validate against ECS JWKS). - PII/Sensitive: events never include secret values — only Key Vault references & hashes.
🧠 In‑Process Hot Reload (Options Pattern)¶
- ECS SDK plugs into
.NETIOptionsMonitor<T>, mapping config keys to strongly‑typed options. - When an event arrives, SDK rebinds options and triggers
OnChangecallbacks with debounce to prevent thrashing.
services.Configure<RetryPolicyOptions>(builder => builder
.BindFromEcs("payments", "payment.retryPolicy"));
📊 Observability for Eventing¶
- Spans:
ecs.publish,ecs.dispatch,ecs.consume,ecs.apply. -
Metrics:
-
ecs_events_published_total,ecs_events_consumed_total ecs_consumer_lag_seconds(per service/tenant)ecs_replay_requests_totalecs_dropped_events_total(with reason)- Logs: structured with
changeSetId,configVersion,tenantId,namespace,configKey.
🗺️ Routing & Filtering¶
Subscription Filters (Service Bus):
- SQL filters on
user.properties.tenantId,edition,namespace,configKey. -
Rule examples:
-
tenantId = 't-001' AND namespace = 'payments' edition IN ('enterprise','pro') AND configKey LIKE 'feature.%'
Kafka Consumers:
- Use predicate filter inside consumer callback (cheap check) before apply.
🧰 Failure Scenarios & Handling¶
| Scenario | Behavior |
|---|---|
| Consumer offline | On startup → snapshot + replay from last offset/version |
| Schema mismatch | Send to DLQ with schema_incompatible; emit CompatibilityAlert |
| Long‑running apply | SDK offloads apply to worker; ack only after commit; backpressure increases |
| Tenant revocation | ECS publishes AccessRevoked; SDK drops subscriptions & clears cache |
🧪 Test Matrix (Factory‑grade)¶
- Contract tests: CloudEvents schema validation; signature verification.
- Resilience tests: injected duplicates/out‑of‑order delivery; ensured idempotent apply.
- Load tests: 10k events/min, 3k consumers; measure consumer lag & mean apply latency.
- Chaos: bus outages, partial partitions; verify snapshot + replay convergence.
🧱 Reference Diagrams¶
Fan‑Out Topology¶
flowchart LR
ECS[ECS Outbox Dispatcher] -->|CloudEvents| ASB[(Service Bus Topic)]
ECS -->|CloudEvents| KAFKA[(Kafka Topic)]
ASB --> A1[Service A - payments]
ASB --> A2[Service B - billing]
KAFKA --> A3[Service C - gateway]
A1 --> Cache1[(Local Config Cache)]
A2 --> Cache2[(Local Config Cache)]
A3 --> Cache3[(Local Config Cache)]
Replay & Convergence¶
sequenceDiagram
participant Svc as Service Consumer
participant ECS as ECS API
participant Bus as Bus Topic
Svc->>ECS: GET /snapshot?since=2025.08.21+00020
ECS-->>Svc: Snapshot (Version=...22)
Svc->>Bus: Resume subscription (offset/version=...22)
Bus-->>Svc: Events (...23, ...24)
Svc->>Svc: Apply in order, idempotent
📦 Deliverables¶
- ECS Outbox Dispatcher (.NET Worker) with ASB/Kafka providers.
- CloudEvents Contracts + JSON Schema v1 (
configuration-changed.jsonetc.). - .NET SDK:
SubscribeAsync,IOptionsMonitorbinding, idempotent store, snapshot+replay. - gRPC Watch Service:
WatchConfig(WatchRequest) -> stream ConfigurationChangeEvent. - Ops Dashboards: consumer lag, publish rates, DLQ drilldowns.
- Conformance Tests: duplication, ordering, replay convergence.
🧱 Backlog → Azure DevOps (Epics/Features/Tasks)¶
Epic: ECS Eventing Backbone (ASB/Kafka)
-
Feature: Outbox storage + dispatcher
-
Task: Outbox table & transactional write (NHibernate)
- Task: Dispatcher with de‑dup by
(changeSetId, etag) - Task: Retry policy & DLQ integration
-
Feature: CloudEvents contracts & validators
-
Task: JSON Schema & contract tests
- Task: JWS signing & JWKS rotation
-
Feature: Service Bus topology as code
-
Task: IaC for topic, subscriptions, filters
- Task: SLOs & alarms (publish/consume error rate)
Epic: ECS Consumer SDK (.NET)
-
Feature: Subscription API + Options binding
-
Task:
IEcsChangeSubscriber+ host extensions - Task: Idempotency store (pluggable: memory/redis/sql)
- Task: Snapshot & replay client
-
Feature: Observability hooks
-
Task: OTEL spans & metrics
- Task: Structured logs with
changeSetId
Epic: gRPC Watch & Webhooks
-
Feature:
WatchConfigstreaming service -
Task: mTLS, auth interceptors, backpressure
-
Feature: Webhook sender
-
Task: HMAC/JWS signing, retry with exponential backoff
Epic: Testing & Chaos
-
Feature: Contract & resilience suite
-
Task: Duplicate/out‑of‑order injection tests
- Task: Replay convergence E2E
- Task: Load test harness & baseline SLOs
✅ Acceptance Criteria¶
- A config write results in a CloudEvent published to the bus within < 250 ms p50.
- Consumers that are offline converge using snapshot + replay with zero drift (checksums match).
- At‑least‑once guaranteed; duplicates handled without side effects (idempotent apply proven in tests).
- Versioning: monotonic
configVersionincreases and enforced; ETag required on updates. - Telemetry shows publish/consume/lag metrics per tenant and service; DLQ has actionable reasons.
🔭 Notes & Next Steps¶
- Align policy‑driven overrides with event filters (tenant/edition rules → subscription rules).
- Add Config Rollback Event (
ConfigurationRolledBack) with automatic replay to target version. - Prepare multi‑region event replication plan (ASB geo‑disaster recovery / Kafka MirrorMaker).
ECS becomes a reactive, version‑aware, and verifiably reliable configuration backbone for the entire ConnectSoft ecosystem.
🚀 Caching & Performance¶
🎯 Goals¶
Design a caching and performance strategy that delivers sub‑30ms p95 resolve latency at factory scale, while preserving consistency, tenancy isolation, and deterministic resolution.
- In‑memory + distributed cache (Redis) with precise invalidation
- Snapshot lifecycle for fast cold‑start and replay
- Scalable read path (high QPS) and controlled write path (safe propagation)
🧱 Cache Architecture (multi‑tier)¶
| Tier | Scope | What it stores | TTL / Consistency | Purpose |
|---|---|---|---|---|
| L0 – Process cache | Per API pod | Resolved blobs by (tenantId, env, namespace, selector) + ETag |
Short TTL (e.g., 3–10s) + SWR | Nanosecond access for hot prefixes |
| L1 – Redis | Cluster‑wide | Resolved blobs + prefix indexes; also recent snapshots | TTL 30–120s (per key) + explicit invalidation | Cross‑pod sharing, cuts DB pressure |
| L2 – DB | SQL/Cosmos | Authoritative bundles/versions | Strong read | Source of truth for cache misses |
Key prefixes
ecs:v1:resolved:{tenant}:{env}:{ns}:{selector} -> {etag, version, json}
ecs:v1:index:{tenant}:{env}:{ns} -> list of keys / etags
ecs:v1:snapshot:{tenant}:{env}:{ns}:{ver} -> frozen state (blob id/ref)
🔁 Read Path (hot/cold)¶
sequenceDiagram
participant S as Service SDK
participant API as ECS API
participant L0 as L0 Cache (memory)
participant R as Redis (L1)
participant DB as Store (SQL/Cosmos)
S->>API: Resolve(tenant, env, ns, selector, If-None-Match)
API->>L0: TryGet(etag)
alt Hit
L0-->>API: Resolved + etag
else Miss
API->>R: GET ecs:v1:resolved:...
alt Hit
R-->>API: Resolved + etag
API->>L0: Put
else Miss
API->>DB: Query bundles (G→E→T→S→tags)
API->>R: Set resolved + index
API->>L0: Put
end
end
API-->>S: 200/304 with etag
Techniques
- ETag + 304 to avoid payload transfer
- SWR (stale‑while‑revalidate): serve slightly stale value while refreshing in background
- Selector compaction: normalize selectors (e.g., sorted tags) to maximize hit rate
🧨 Invalidation & Coherency¶
On publish:
- Write transaction commits new version
- Outbox → Event Bus emits
ConfigurationChanged -
Cache invalidator:
-
Evict
ecs:v1:resolved:*keys whose prefix intersects the changed key/namespace - Bump namespace index to force L0 recalc on next read
Precision rules
- Maintain key→prefix map (stored in Redis) to target minimal eviction set
- Protect against stampedes with singleflight locks per key
🧊 Snapshot Lifecycle¶
| Stage | Detail |
|---|---|
| Create | On schedule (e.g., hourly) or on demand; compute resolved state for (tenant, env, ns); store in Blob; index pointer in Redis ecs:v1:snapshot:* |
| Use | Cold start or replay → fetch latest snapshot pointer, hydrate L1/L0 quickly |
| Rotate | Keep last N (e.g., 72 hourly) per scope; delete older (configurable per edition) |
| Validate | Checksums of snapshot vs live resolution to detect drift |
| Promote | During incidents, mark a snapshot as fallback; read path returns snapshot when DB/Bus degraded |
API (read):
GET /snapshot?tenantId&env&ns[&version]→ blob ref + checksum- SDK can request snapshot on boot, then subscribe to events
⚖️ Scaling Reads vs Writes¶
Reads (dominant):
- Scale API pods horizontally (HPA on RPS/latency)
- Redis clustering (hash slot spreading on
tenant|env|ns) - Hot‑tenant sharding: per‑tenant Redis db/index when necessary
- Compression (LZ4) for large resolved blobs to reduce network
Writes (controlled):
- Gate through Admin API with If‑Match and idempotency keys
- Serialize heavy publish waves per namespace (per‑prefix semaphore)
- Outbox throughput scaling with workers; batch invalidations
🧪 Performance Targets¶
- p95 resolve latency: ≤ 30 ms (hot cache), ≤ 150 ms (cold)
- Cache hit ratio: ≥ 85% overall; ≥ 95% for top 20 namespaces
- Publish → visible (watchers): ≤ 3 s p95
- Redis ops: P50 < 2 ms; 0 timeouts under 99.9th percentile
- DB read QPS reduction: > 90% vs no‑cache baseline
🧰 Protection & Resilience¶
- Thundering herd guard: per‑key singleflight + jittered backoff
- Adaptive TTLs: increase TTL when bus lag detected to protect DB
- Read‑degrade mode: serve last known snapshot on DB outage window
- Rate limiting: per‑tenant read QPS caps; publish rate caps
- Circuit breakers: around Redis/DB; fallback chain L0→snapshot
👩🔧 Tuning Playbook¶
-
Low hit ratio?
-
Check selector cardinality; introduce prefix bucketing or coarser namespaces
- Enable response compression for large maps
-
High Redis CPU?
-
Increase sharding; switch to key hashing on
tenant|ns - Raise L0 TTL modestly (5→10s) for hottest endpoints
-
Stampedes after publish?
-
Stagger invalidation; use incremental delta apply for L0 refresh
- Limit concurrent cold resolvers with a token bucket
🗺️ Diagram — Cache & Snapshot Topology¶
flowchart LR
API[Config API]:::svc -- L0 get/set --> L0[(Process Cache)]
API -- get/set --> R[(Redis L1 Cluster)]
API -- miss --> DB[(SQL/Cosmos)]
Snap[Snapshotter]:::wrk -- resolved->blob --> Blob[(Snapshots)]
Snap -- indexes --> R
Pub[Publisher]:::wrk -- events --> Bus[(Event Bus)]
Inv[Invalidator]:::wrk -- targeted evict --> R
R -- hydrate --> L0
classDef svc fill:#E0F2FE,stroke:#38BDF8;
classDef wrk fill:#FFE4E6,stroke:#FB7185;
✅ Acceptance Criteria¶
- AC‑1: Writes invalidate only affected prefixes; unrelated namespaces retain ≥ 95% hit ratio.
- AC‑2: Under 5k RPS sustained, p95 read latency ≤ 30 ms (hot), error rate < 0.5%.
- AC‑3: During DB outage drills (5 min), API serves from snapshots with no 5xx spikes, and emits degraded mode metric.
- AC‑4: Snapshot/restore produces byte‑identical resolved maps to live resolution for the same version.
- AC‑5: Cache stampede tests show singleflight success (no >3x concurrent DB hits for same key).
🔧 Backlog → Azure DevOps¶
Epic: L0/L1 Cache & Invalidation
- Feature: L0 in‑proc cache + SWR
- Feature: Redis schema & prefix strategy
- Feature: Precision invalidation worker
- Tasks: key maps, singleflight, compression, TTL policy
Epic: Snapshot Lifecycle
- Feature: Snapshotter worker & checksums
- Feature: Fallback mode & API
- Tasks: Blob storage model, retention policy, drift checker
Epic: Performance & Scale
- Load test harness (Locust/K6)
- Cache hit & latency dashboards
- Auto‑tuning hooks (adaptive TTL, backoff)
📌 Notes¶
- Consider optional edge cache (sidecar or local Redis) for ultra‑low latency per node.
- For very large tenants, introduce namespace partitioning and hierarchical snapshots (per sub‑namespace).
ECS’s read path becomes fast, predictable, and resilient, and the write path safely fans out with precise cache coherence.
🛡️ Resiliency & Fault Tolerance¶
Goal: ensure ECS continues to serve safe, correct, and timely configuration under partial failures, regional incidents, and dependency degradation — without violating tenant isolation, security policies, or observability guarantees.
🎯 Objectives¶
- Survive transient and regional faults with graceful degradation (read-mostly mode).
- Protect dependent services via bounded retries, circuit breakers, bulkheads.
- Fail fast on unsafe paths; serve stale-but-safe configs when allowed by policy.
- Prove resilience with chaos drills, SLAs/SLOs, and automated failover runbooks.
🧨 Failure Model (ECS)¶
| Surface | Typical Faults | Primary Mitigations |
|---|---|---|
| Read path (GET config) | Redis miss, origin DB latency, network partitions | Local snapshot, Redis cluster, hedged reads, timeouts |
| Write path (PUT/PATCH) | DB leader loss, quorum fail, version conflict | Optimistic concurrency, idempotent writes, queue-backed commit |
| Eventing (change notifications) | Service Bus/Kafka outage, consumer lag | Outbox + retry, backfill from ledger, idempotent subscribers |
| AuthN/Z | IdP outage, token validation latency | Token cache, STS fallback keys, mTLS pinning |
| Secrets | Key Vault throttling | Per-tenant secret cache, exponential backoff, jitter |
| Region | AZ/region failure | Active–active reads, active–standby writes, DNS/AFD failover |
🧱 Resilience Architecture¶
flowchart LR
Client((Service)) -->|gRPC/REST| Edge[ECS API Gateway]
Edge --> CB{Circuit\nBreaker}
CB --> L1[(In-Proc Snapshot Cache)]
L1 -->|miss| L2[(Redis Cluster)]
L2 -->|miss| Origin[(Cosmos DB / Postgres Primary)]
Origin --> Ledger[(Change Ledger / Event Outbox)]
Ledger --> Bus[(Service Bus / Kafka)]
Bus --> Subscribers[[Microservices/Agents]]
subgraph Regional Pair
Origin---OriginReplica[(Geo-Replicated Read)]
Redis[(Redis)]---RedisReplica[(Geo-Replica)]
end
- Reads: L1 (in-proc) → L2 (Redis) → Origin (Cosmos/PG). Serve stale snapshot within TTL if origin slow.
- Writes: Single-writer per partition (tenant/namespace). Outbox pattern persists change, publishes event.
- Change propagation: At-least-once from outbox → bus; consumers are idempotent (version vector, ETag).
🗂️ Fallback & Degradation Policies¶
| Scenario | ECS Behavior | Notes |
|---|---|---|
| Origin slow (>p95 threshold) | Stale-OK read from Redis or local snapshot if policy.allowStale=true and TTL valid |
Emit degraded_mode=true metric, add X-Config-Stale-Age |
| Redis unavailable | Skip L2, fall back to local snapshot; increase hedged read to origin | Shorter timeouts to avoid threadpool exhaustion |
| Bus outage | Writes commit to origin + outbox table; background publisher replays when bus recovers | Consumers dedupe by (tenantId, key, version) |
| Tenant-scoped incident | Trip per-tenant breaker; serve last good tenant snapshot; block writes for tenant | Prevents blast radius |
| Global auth outage | Honor cached token validations (bounded TTL); keep mTLS; reduce JWK rotations temporarily | Never bypass RBAC |
⚙️ .NET Resilience Profile (Polly)¶
services.AddHttpClient("EcsOrigin")
.AddPolicyHandler(Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromMilliseconds(300)))
.AddPolicyHandler(Policy<HttpResponseMessage>.Handle<Exception>()
.WaitAndRetryAsync(3, retry => TimeSpan.FromMilliseconds(50 * Math.Pow(2, retry)) + Jitter()))
.AddPolicyHandler(Policy<HttpResponseMessage>.Handle<Exception>()
.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)));
static TimeSpan Jitter() => TimeSpan.FromMilliseconds(Random.Shared.Next(0, 40));
- Bulkheads per endpoint to cap concurrent origin calls.
- Hedged requests (optional): fire a second read to replica after p95 latency.
🧾 Config Snapshot Lifecycle¶
- Acquire: On successful read from origin, build
ConfigSnapshot { tenantId, ns, version, etag, data, capturedAt }. - Store: L1 (memory) + L2 (Redis, key:
cfg:{tenant}:{ns}:{version}) with TTL & size guard. - Serve: Prefer latest
version; if origin fails, use highest valid within policy TTL. - Invalidate: On change event → purge L1, update L2; add
etagto prevent stale overwrite. - Audit: Record snapshot usage (fresh vs stale) with
traceId,tenantId.
🔁 Idempotency & Versioning¶
- Writes: Require
If-Match: <ETag>orversion. On conflict →409with latest pointer. - Events: Include
(aggregateId, version); consumers ignore if already applied. - Replay: On recovery, publisher replays outbox by
createdAtandnotified=false.
🧪 Chaos Testing Scenarios¶
| Area | Fault Injection | Expected Outcome |
|---|---|---|
| Cache | Kill Redis primary; network partition | Stale reads from L1; perf dip within SLO; no 5xx spikes |
| Origin | 500/timeout storm; leader failover | Stale-OK reads; write breaker trips → read-only mode banners |
| Bus | Drop topic; throttle | Outbox backlog grows; No lost events; catch-up within RTO |
| Auth | JWK endpoint down | Cached keys honored; no auth bypass |
| Region | Simulate regional fail | Traffic manager to secondary reads; write drain & promote in ≤ RTO |
Schedule GameDays monthly; record hypotheses, metrics, and remediations.
📈 SLOs & SLA Envelope (Proposed)¶
| Dimension | Target | Notes |
|---|---|---|
| Availability (reads) | 99.99% monthly | With stale-OK serving |
| Availability (writes) | 99.9% monthly | May block during conflict/region failover |
| p95 read latency | ≤ 20ms (cache hit), ≤ 120ms (origin) | Per-tenant |
| Event propagation | ≤ 3s p95 | From commit to first consumer delivery |
| RPO | ≤ 30s | Config ledger + cross-region replication |
| RTO (regional) | ≤ 15 min | Promote secondary; re-point writers |
| Error budget (reads) | 4m 22s / month | Tracked in SRE dashboard |
Contracts: SLA doc states credit policy if monthly availability below target; per-tenant SLOs are observable (dashboards & reports).
🌍 Failover Architecture¶
- Reads: Active–Active (multi-region replicas for origin + Redis replica); DNS/AFD/Envoy locality.
- Writes: Active–Standby per partition (tenant/namespace). Single-writer enforced by lease (Cosmos) or advisory lock (PG).
-
Data:
-
Cosmos DB: multi-region write (optional) with conflict resolver = highest version.
- PostgreSQL: logical replication; promote with pg_auto_failover; ensure write fences during switchover.
- Secrets: Key Vault geo-redundant, soft-delete + purge protection; client caches secrets with TTL.
sequenceDiagram
participant Client
participant ECS
participant Redis_Primary
participant DB_Primary
participant DB_Secondary
Client->>ECS: GET /config
ECS->>Redis_Primary: TryGet
alt miss/timeout
ECS->>DB_Primary: Read
DB_Primary--xECS: timeout
ECS->>DB_Secondary: Hedged Read
ECS-->>Client: 200 (stale-ok), X-Config-Stale-Age
else hit
ECS-->>Client: 200 (fresh)
end
🧩 Read/Write Mode Matrix¶
| Mode | Reads | Writes | Trigger | Recovery Signal |
|---|---|---|---|---|
| Normal | Fresh | Allowed | Healthy deps | N/A |
| Degraded | Stale-OK | Allowed | Origin latency > threshold | Latency normalizes |
| Read-Only | Stale-OK | Blocked | Origin unavailable; version conflicts | DB healthy + catch-up complete |
| Failover | Fresh via secondary | Allowed after promote | Region incident | Health gates + leader elected |
🔔 Telemetry & Alerts (Resilience Signals)¶
- Counters:
config_stale_served_total{tenantId},outbox_backlog_size,breaker_open_total{scope}. - Gauges:
snapshot_age_seconds,publish_lag_seconds. - SLO: burn-rate alerts 2%/1h and 5%/6h on read availability.
- Events:
DegradedModeEntered,ReadOnlyModeEntered,FailoverStarted,FailoverCompleted.
🧰 Ops Runbooks (abridged)¶
- Degraded Mode Identify hotspot tenants → increase per-tenant TTL; ensure Redis healthy; verify breaker state.
- Write Conflicts
Inspect conflict keys; return
409with latest ETag; advise client retry using backoff. - Bus Backlog Scale publishers/partitions; verify outbox replay; validate consumer idempotency.
- Regional Failover Freeze writers; promote secondary; run consistency checks; unfreeze by partition.
🔐 Safety & Compliance¶
- Stale reads respect edition/tenant policy overlays; never cross-tenant.
- No secret material in snapshots; secrets are references resolved at call time with cache TTL.
- All degraded/RO decisions are audited (
who,why,since).
✅ Deliverables¶
- Resilience ADRs: cache-first reads, outbox + at-least-once, active–standby writes.
.NETresilience profile (Polly) and configuration schema:resilience: { allowStale: true, staleTtlSec: 120, readTimeoutMs: 300, retries: 3, breaker: { failCount: 5, breakSec: 30 } }- Chaos plan & scripts (fault maps), SLO dashboards, runbooks.
🔜 Epics / Azure DevOps (seed)¶
- ECS-RES-01: Implement snapshot cache (L1/L2) with TTL & audit.
- ECS-RES-02: Outbox publisher + idempotent consumer SDK.
- ECS-RES-03: Circuit breaker/bulkhead/hedged reads middleware.
- ECS-RES-04: Degraded/Read-only mode controller + health endpoints.
- ECS-RES-05: Chaos suite & monthly GameDay pipeline.
- ECS-RES-06: Multi-region failover automation + runbooks.
- ECS-RES-07: SLO burn-rate alerts & resilience dashboard.
With these guards, ECS remains predictable under pressure, protecting downstream services while providing observability and control to ops and agents.
🔄 Deployment, CI/CD & DevOps Enablement¶
🎯 Goals¶
Provide a repeatable, secure, observable path to deliver ECS across dev → test → staging → prod, with immutable packaging, automated rollouts/rollbacks, and zero‑downtime upgrades.
📦 Packaging (Artifacts & IaC)¶
| Artifact Type | Tooling | Purpose | Notes |
|---|---|---|---|
| Container Image | Dockerfile → ACR | Runtime unit for API/Workers | Signed, SBOM attached, Trivy‑scanned |
| K8s Charts | Helm | ECS components (API, Streamer, Workers) | Values overlays per env/tenant |
| Azure Infra | Bicep / Terraform / Pulumi | AKS, ACR, Key Vault, SQL/PG, Redis, Service Bus | Policy‑as‑code (Azure Policy/OPA) |
| Migrations | EF Core / Flyway | DB schema & data migrations | Forward‑only + rollback plan |
| Release Bundle | .tar.gz |
Helm chart + values + migration scripts + release notes | SemVer tag (e.g., ecs-1.6.3) |
Helm values overlays
/deploy/helm/values-dev.yaml
/deploy/helm/values-test.yaml
/deploy/helm/values-staging.yaml
/deploy/helm/values-prod.yaml
🔁 Dev/Test/Prod Rollout¶
flowchart TD
A[Commit: src + IaC] --> B[CI: Build & Scan]
B --> C[Push: ACR + Chart Repo]
C --> D[CD: Dev Deploy]
D --> E[Smoke + Contract Tests]
E --> F[Test/Staging Deploy]
F --> G[Perf & Chaos Gates]
G --> H[Manual Approval]
H --> I[Prod: Blue/Green or Canary]
I --> J[Post‑deploy Verification + Auto Rollback if failing]
J --> K[Tag + Audit + Release Notes]
Rollout strategies
- Blue/Green for API pods (Envoy/AGIC switch on health).
- Canary (e.g., 10% → 50% → 100%) guarded by SLO checks (p95 latency, error rate, consumer lag).
- KEDA scales workers by outbox depth / schedule (snapshotter).
🧬 Versioning & Migration Automation¶
Semantic Versioning
MAJOR.MINOR.PATCHfor ECS; backward‑compatible APIs within MINOR.- Configuration schema versions tracked in repo (JSON Schema URIs).
DB Migrations
-
EF Core/Flyway run as pre‑install Helm hooks:
-
hook: pre-install, pre-upgrade→migration-job - Idempotent; write migration ledger to SQL.
- On failure: abort upgrade, auto‑rollback to previous chart, publish incident event.
Config Schema Evolution
- Contracts validated at publish time; compatibility checks (breaking change requires
force=true+ elevated RBAC). - Data backfills via worker job (post‑upgrade hook), observable via
ecs_backfill_pending.
Zero‑Downtime Policy
- API pods roll with maxUnavailable=0.
- Sticky read cache retained; consumers use ETag + watchers to avoid reload storms.
🧪 CI/CD Pipelines (Azure DevOps YAML — excerpt)¶
stages:
- stage: CI
jobs:
- job: build
steps:
- task: Docker@2
inputs: { command: buildAndPush, repository: ecs/api, tags: $(Build.BuildNumber) }
- task: HelmInstaller@1
- script: helm lint deploy/helm/ecs
- task: TrivyScan@1
- task: PublishBuildArtifacts@1
- stage: CD_Dev
dependsOn: CI
jobs:
- deployment: dev
environment: ecs-dev
strategy:
runOnce:
deploy:
steps:
- script: helm upgrade --install ecs deploy/helm/ecs -f deploy/helm/values-dev.yaml --set image.tag=$(Build.BuildNumber)
- script: ./ops/smoke.sh https://ecs-dev.internal
- stage: CD_Prod
dependsOn: CD_Staging
approval: Manual
jobs:
- deployment: prod
environment: ecs-prod
strategy:
canary:
increments: [10,50,100]
deploySteps:
- script: helm upgrade --install ecs ...
- script: ./ops/verify-slo.sh --latency-p95 30 --errors 0.5
onFailure:
- script: ./ops/rollback.sh
🔒 Security & Policy Gates (shift‑left)¶
- Image & chart scanning (Trivy/Grype) — block on HIGH/CRITICAL.
- IaC checks (Checkov/OPA) — deny insecure networking, public KV, missing TLS.
- Secrets: all values sourced from Key Vault references; no secrets in values files.
- Sign & verify: Cosign images; Helm chart provenance (
.prov).
📊 Release Observability¶
- Pipeline emits deployment events with
traceId, version, change set. - Golden signals checked during canary:
ecs_resolve_latency_ms p95,5xx rate,ecs_event_fanout_lag_ms. - Auto‑rollback if thresholds breached for 5–10 minutes window.
- Release notes generated from commits + PR labels (features/fixes/breaking).
🗂️ Repo & Environment Layout (suggested)¶
/src/ecs-api
/src/ecs-workers
/sdk/dotnet
/deploy/helm/ecs
/deploy/bicep|tf|pulumi
/ops/scripts (smoke, verify-slo, rollback, snapshot-restore)
/migrations (db, schema)
/docs/hld
✅ Acceptance Criteria¶
- AC‑1: One‑button pipeline promotes dev → test → staging → prod with signed artifacts and policy gates.
- AC‑2: Zero‑downtime upgrade verified (no 5xx spikes > 0.5% and p95 ≤ targets during rollout).
- AC‑3: Failed canary auto‑rolls back to last healthy release; audit entries recorded.
- AC‑4: Migrations are idempotent, logged, and observable; rollback plan documented and tested.
- AC‑5: All secrets consumed via Key Vault references; pipeline blocks on any hardcoded secret detection.
🔧 Backlog → Azure DevOps¶
Epic: Packaging & IaC
- Feature: Helm chart + env overlays
- Feature: Bicep/Terraform/Pulumi modules for AKS, Redis, SQL/PG, ASB, KV
- Task: Chart provenance + image signing
Epic: Pipelines & Gates
- Feature: CI build/scan/sign
- Feature: CD with canary/blue‑green + auto‑rollback
- Feature: SLO verifiers & deployment events
Epic: Migrations & Schema Evolution
- Feature: Migration runner (hooks) + ledger
- Feature: Config schema compat checker
- Feature: Backfill job + observability
Epic: Ops Tooling
- Feature: Smoke/verify/rollback scripts
- Feature: Release notes generator
- Feature: Disaster‑recovery playbook (snapshot restore)