📋 ConnectSoft SaaS Factory — High-Level Design (HLD)¶

The ConnectSoft SaaS Factory is a generic, template-driven platform for creating SaaS solutions quickly, securely, and consistently.

Unlike traditional projects where each product is designed from scratch, the factory provides paved roads: pre-built architectural templates, infrastructure packs, and automation workflows that allow product teams to focus on business value instead of reinventing technical foundations.

Core pillars:

Cloud-Native & Event-Driven: Designed for scalability and resilience.
Security-First & Compliance-by-Default: Guardrails embedded in every template.
Multi-Tenant & Edition-Aware: Monetization and isolation strategies built in.
AI-First Orchestration: Semantic Kernel and agentic workflows accelerate delivery.
Observability Everywhere: Traces, logs, metrics included out of the box.

Target Benefits:

Cut time-to-market for new SaaS offerings from months to days.
Guarantee consistency across solutions, reducing operational cost.
Ensure regulatory compliance from the first deployment.
Enable innovation by freeing teams from repetitive platform work.

Introduction¶

Context¶

Organizations repeatedly face the same engineering challenges when building SaaS solutions: identity, tenancy, observability, compliance, billing, and extensibility. These components are not differentiators, yet consume majority of engineering time.

The ConnectSoft SaaS Factory addresses this by codifying best practices into reusable assets — templates, IaC modules, service packs, and architectural blueprints — that teams can compose into product recipes.

Audience¶

This HLD is intended for:

Enterprise Architects: to understand platform scope and evolution.
Engineering Leads & Developers: to apply templates and patterns.
Product & Business Stakeholders: to validate alignment with goals.
Operations & Security Teams: to verify compliance and guardrails.

Scope of Document¶

What this is: A high-level design of a SaaS Factory that produces SaaS solutions.
What this is not: A low-level implementation guide or a one-off SaaS product design.

Vision, Outcomes & Scope¶

Purpose¶

The ConnectSoft SaaS Factory exists to provide a generic, reusable platform for building and operating SaaS solutions. Instead of treating each SaaS product as a bespoke project, the factory offers a set of templates, architecture patterns, and automation packs that allow teams to bootstrap new SaaS offerings with speed, quality, and compliance already embedded.

This factory approach ensures that:

Time-to-market is drastically reduced through pre-built blueprints.
Consistency across solutions is maintained, lowering operational overhead.
Security and compliance are guaranteed by default, not bolted on later.
Observability is always enabled, allowing proactive operations.
Multi-tenancy and edition models are first-class, making monetization and scaling straightforward.

Problem Space¶

Modern SaaS builders face common challenges:

Repetition of patterns (auth, tenancy, billing, observability) across every product.
Time lost in rebuilding baseline infrastructure and compliance controls.
Inconsistent quality when different teams make different architectural choices.
High barrier to scaling due to lack of standardized multi-tenancy, resilience, and observability.
Slow compliance adoption (GDPR, HIPAA, SOC2) when not embedded from the start.

The SaaS Factory addresses these by offering paved roads: highly opinionated but extensible blueprints.

Target Value Map¶

Dimension	Value Proposition	Example Metric
Speed	Reduce new SaaS product setup from months to days.	Time from product idea → first running environment.
Consistency	Unified architecture, code templates, IaC.	% of services following factory patterns.
Compliance	Security & privacy controls built-in.	% of audits passed without major findings.
Scalability	Ready-to-scale multi-tenant core.	Number of tenants supported per cluster/node.
Innovation	AI-first orchestration and automation.	% of backlog items delivered by AI-generated scaffolds.

High-Level OKRs¶

Objective 1: Reduce time-to-market for SaaS products.

KR1: 80% of new products bootstrapped with factory templates within 1 day.
KR2: First production tenant live within 2 weeks of project kickoff.

Objective 2: Ensure security, compliance, and observability by default.

KR1: 100% of generated services include OTel, logging, metrics.
KR2: 100% of generated services use workload identity + secret-less design.
KR3: ≥ 95% compliance checklist coverage at go-live.

Objective 3: Provide consistent developer experience.

KR1: 90% of developers report positive DX (survey).
KR2: Mean time to onboard new engineer ≤ 2 days.

Explicit Non-Goals¶

Not a one-off platform: This is a factory, not a custom-built product for a single SaaS.
Not vendor-locked: Although Azure is primary, abstractions exist for other clouds.
Not monolithic: No “mega-service”; all blueprints enforce microservices and modular design.
Not compliance-by-documentation: Compliance is enforced technically (policy gates, guardrails), not left to paperwork.
Not a silver bullet: Factory does not replace product-specific business logic; teams still own domain innovation.

Guardrails¶

Security & observability cannot be disabled.
Multi-tenancy and editioning are always enforced.
ADRs required for deviations from factory templates.
Paved road approach: 80% covered by factory defaults, 20% extensible through overrides.

Personas, Tenants & Editions¶

Personas¶

The platform supports multiple personas across different layers of the SaaS lifecycle. These are not tied to one product but apply generically across all SaaS solutions generated by the factory:

Tenant Administrator Responsible for onboarding, user management, and tenant-level configuration. Uses self-service tooling to manage editions, roles, and features.
End User Consumes product functionality within the context of their tenant. Interacts only with features entitlements assigned by edition and policy.
Billing & Finance Operator Oversees subscriptions, invoices, and payments across tenants. Works with dynamic billing models and edition-based monetization.
Support Agent Provides operational and customer support. Needs tenant-specific observability and diagnostic views across editions.
Platform Operator (SRE/DevOps) Maintains infrastructure, enforces global guardrails, ensures SLA adherence across multiple SaaS solutions. Focuses on tenant isolation, scaling, and compliance.
Product Owner / Engineering Teams Define new products, editions, and feature packs inside the system dynamically. Responsible for extending templates to create differentiated SaaS offerings.

Tenant Archetypes¶

Different tenant profiles are supported dynamically, with no need to hardcode rules in the factory:

Trial / Evaluation Tenant Activated on sign-up, bound to a Free edition, strict quotas. Ideal for fast adoption and funnel growth.
Growth Tenant (SMB / Mid-Market) May purchase Standard or custom editions. Needs integration with external identity, moderate quotas, and cost-effective plans.
Enterprise Tenant Assigned to Enterprise edition or a custom product pack. Requires SSO, SCIM provisioning, audit exports, custom compliance, and region-specific data residency.

Editions¶

The factory does not limit editions to a fixed set; editions are defined dynamically in the target product configuration:

Free Edition (default template) Minimal feature set, reduced quotas, short SLA. Used as a baseline template for trials.
Standard Edition (default template) Full baseline capability pack, medium quotas, typical SLA.
Enterprise Edition (default template) Advanced feature pack, premium quotas, enterprise SLA.
Custom Editions Created dynamically by product owners in the system, composed of features, quotas, and entitlements. Example: Healthcare Pack, FinTech Pack.

All editions are described as metadata objects stored in the platform and enforced by policy engines at runtime.

Tenant × Edition × Capability Matrix¶

Because editions are dynamic, the following matrix represents the default templates provided by the factory. Product owners may extend or override it.

Capability	Free	Standard	Enterprise	Custom (Dynamic)
Multi-Tenancy Isolation	✓	✓	✓	✓
Identity (OIDC/SSO)	–	✓	✓ (SCIM)	Configurable
API Rate Limit (req/min)	60	600	3000	Configurable
Feature Flags	Basic	Standard	Advanced	Configurable
Observability Dashboards	Shared	Tenant	Advanced	Configurable
Data Residency Choice	–	–	✓	Configurable
SLA / Support	–	8×5	24×7	Configurable

Top Use Cases¶

Dynamic Product Creation Product owner defines a new product, editions, and entitlements via admin console or API. The factory applies baseline templates and stores metadata.
Tenant Onboarding Tenant signs up and is provisioned with the default Free edition. Admin can later upgrade or assign a custom edition without re-deployment.
Edition Upgrade / Downgrade Tenant transitions dynamically between editions (e.g., Free → Standard, Standard → Enterprise) without migration downtime.
Custom Edition Rollout A healthcare SaaS product defines a Healthcare Pack with HIPAA compliance, custom storage policies, and audit logging. Tenants can subscribe to this edition dynamically.
Tenant-Specific Support Support agent views tenant entitlements, edition policies, and quotas to resolve feature-availability questions quickly.

Strategic Domains & Context Map¶

Purpose¶

Define the bounded contexts that compose the generic SaaS platform, clarify their responsibilities and collaboration style, and set the principles for clean integration across domains. The goal is to maximize autonomy, enable evolution without ripple effects, and keep multi-tenancy, security, and observability as first-class concerns within each boundary.

Bounded Contexts (overview)¶

Context	Core Responsibility	Primary Interfaces	Persistence (default)	Tenancy Notes
SaaS Core Metadata	System-of-record for Products, Editions, Features, Entitlements, Quotas, Pack composition	REST (admin), Events	Azure SQL	Enforces product/edition metadata; emits entitlement updates
Identity	OIDC provider (OpenIddict/AAD), Users, Roles, Scopes, Service Principals	OIDC/OAuth2, REST, Events	Azure SQL	Token carries `tenantId`, `edition`, `entitlements` claims
Tenant Management	Tenant lifecycle (provision, activate, suspend, delete), Region/Residency, Tenant config bootstrap	REST, Events	Azure SQL	Owns `tenantId`; author of cross-context tenant events
Billing	Plans, Subscriptions, Invoicing, Usage Rating; Payment provider integrations (via ACL)	REST/Webhooks, Events	Azure SQL	Per-tenant subscription state and entitlements sync
Configuration (Config & Feature Flags)	Dynamic flags, settings, edition overrides, kill-switches	REST, Events	Azure SQL + Redis	Per-tenant/edition flag resolution; policy guardrails
Usage & Metering	API call meters, storage/compute consumption, quota enforcement	REST (read), Events (write)	Azure SQL (agg) + cold storage	Emits usage events to Billing and Observability
Notifications	Email/SMS/Webhook dispatch, templates, tenant branding	REST, Events	MongoDB (templates) + Queue	Tenant-scoped channels; signed webhooks
Audit & Compliance	Immutable audit trail, access logs, policy decisions, exports	REST (query), Events (append)	Azure SQL (append-only)	Cross-tenant isolation; long retention & eDiscovery
AI Orchestration	Agentic flows (Semantic Kernel), scaffolding, docs/tests generation, safe tool use	REST, Jobs, Events	Blob/Queue for artifacts	Runs under least-privileged scopes; auditable tools

Notes:

• Storage choices are defaults; contexts can swap with portable equivalents.

• Each context publishes canonical events with required headers (traceId, tenantId, edition, schemaVersion).

Integration Styles & Collaboration¶

Event-first collaboration: Domain facts published as immutable events; subscribers build local read models or kick off workflows.
REST for command/query where necessary: CRUD/admin operations and strongly consistent reads remain RESTful within the boundary.
Webhooks for external extensibility: Billing, Notifications, and Audit expose signed webhooks for third-party systems.
Anti-Corruption Layers (ACLs): External payment, messaging, or identity providers are wrapped behind ACLs to protect internal models.
Outbox/Inbox patterns: All cross-context messages flow through outbox (producer) and inbox (consumer) for idempotency and reliability.

Context Relationships (narrative)¶

SaaS Core Metadata is upstream for Config, Billing, Usage, and Identity by defining products, editions, and entitlements.
Tenant Management is the authoritative source of tenantId lifecycle and residency; downstream systems subscribe to tenant.created/updated/deleted events.
Billing depends on Usage for rated consumption and on SaaS Core Metadata for plan/entitlement definitions; pushes subscription state changes to Identity (claims) and Config (feature gates).
Config consumes product/edition definitions and applies edition overrides and tenant-scoped flags, influencing feature availability platform-wide.
Audit subscribes to events from every context and exposes immutable, query-optimized views for operators and compliance exports.
AI Orchestration is lateral: it reads contracts and policies, generates scaffolds/PRs, and never mutates domain state without going through the owning context’s APIs.
Notifications consumes events across contexts to deliver comms and expose signed webhooks to tenant systems.

Context Map (Mermaid)¶

flowchart LR
  subgraph Upstream
    SCM[SaaS Core Metadata]
    TEN[Tenant Management]
  end

  IDP[Identity]:::core
  BILL[Billing]:::core
  CONF[Config & Flags]:::core
  USE[Usage & Metering]:::core
  AUD[Audit & Compliance]:::core
  NOTIF[Notifications]:::edge
  AIO[AI Orchestration]:::edge

  TEN -->|tenant.created/updated| IDP
  TEN --> BILL
  TEN --> CONF
  TEN --> USE
  TEN --> AUD
  TEN --> NOTIF

  SCM -->|product/edition/entitlement| BILL
  SCM --> CONF
  SCM --> IDP

  USE -->|usage.recorded| BILL
  BILL -->|subscription.activated/changed| IDP
  BILL -->|entitlements.updated| CONF

  %% Lateral
  AIO -. reads contracts & policies .-> SCM
  AIO -. opens PRs/tests .-> CONF

  %% Observability/Audit taps
  IDP --> AUD
  BILL --> AUD
  CONF --> AUD
  USE --> AUD
  NOTIF --> AUD

  classDef core fill:#0b6,stroke:#094,stroke-width:1,color:#fff;
  classDef edge fill:#357,stroke:#234,stroke-width:1,color:#fff;

Hold "Alt" / "Option" to enable pan & zoom

Anti-Corruption Layer Patterns¶

Payment Providers (e.g., Stripe, Adyen): Billing implements a Payment ACL translating provider-specific objects (invoices, payment intents) into internal subscription and charge events. Retries and signature validation live in the ACL.
External Identity Providers (e.g., Azure AD, Okta): Identity provides a Federation ACL for SSO and SCIM. External attributes are mapped to internal roles/scopes/claims with normalization and validation.
Email/SMS Gateways: Notifications uses a Messaging ACL to normalize templates, rate limits, and delivery receipts across providers.

Tenancy & Security Considerations¶

Tenant authority: Tenant Management is the sole issuer of tenantId and region/residency attributes; all contexts must validate inbound tenantId against their read model.
Edition enforcement: Identity tokens include edition and derived entitlements; services must authorize using policy checks, not UI hints.
Isolation: Data access is constrained by repository guards and (where supported) RLS/filters keyed by tenantId.
mTLS & Workload Identity: Cross-context calls occur over mTLS; services authenticate using workload identities, not secrets.
Least privilege: Each context exposes minimal operations; event consumers run with scoped permissions.

Observability & SLO Notes¶

Traceability: Every cross-context call/event carries traceId, tenantId, and edition. Spans are named ctx.operation (e.g., billing.rate-usage).
Golden signals per context:
- Identity: token issuance latency p95, failure rate
- Billing: invoice generation p95, reconciliation lag
- Config: flag evaluation latency p95, cache hit%
- Usage: event ingest lag p95, quota enforcement accuracy
- Audit: write throughput, query latency p95
Error budgets: Defined per context with shared budgets for critical flows (e.g., onboarding path TEN→CONF→IDP).

Evolution Principles¶

Independent deployability: Contexts release on their own cadence; contracts evolve via additive changes and versioned topics/paths.
Schema evolution: Event versioning via type.suffix.vN; consumers must tolerate unknown fields.
Backwards compatibility: Deprecations follow a documented sunset window; ACLs absorb third-party churn.
Testing in isolation: Contract and consumer-driven tests validate behavior without requiring full-platform spin-up.

Solution Architect Notes¶

Start with SaaS Core Metadata and Tenant Management as foundation; everything else composes around their events.
Treat Billing and Config as policy enforcers driven by metadata and subscriptions; resist hardcoding feature switches inside services.
Keep AI Orchestration stateless with auditable tool calls; it should propose and scaffold, not silently mutate domain state.
Maintain strict Audit boundaries: append-only, immutable identifiers, and explicit export paths with data minimization.

Non-Functional Requirements & SLOs¶

Purpose¶

The SaaS Factory must deliver predictable, reliable, and secure foundations for any generated SaaS product. Non-functional requirements (NFRs) set the quality bar across dimensions like performance, availability, scalability, security, privacy, compliance, and portability. Service Level Indicators (SLIs) and Service Level Objectives (SLOs) ensure that these qualities are measured and enforced consistently.

Performance¶

API Latency:
- p95 read latency ≤ 200 ms
- p95 write latency ≤ 350 ms
Throughput:
- Minimum 1k requests/minute per tenant on pooled editions
- Scalable to 50k requests/minute on Enterprise

Performance requirements apply uniformly across generated services, with quotas and scaling managed by edition policies.

Availability¶

Platform Uptime: 99.9% monthly baseline (Enterprise tiers may raise this to 99.95%)
Critical APIs (Auth, Tenant Onboarding, Billing): ≥ 99.95%
Scheduled Maintenance: Maximum of 4 hours per month, communicated via status channels

Error budgets are defined per tier; for example, Free tenants may tolerate higher downtime than Enterprise tenants.

Scalability¶

Multi-Tenant Growth: Must scale to 10k+ tenants per cluster with pooled database model.
Elastic Scaling: Auto-scale services based on CPU/memory utilization and queue lag.
Edition-Aware Quotas: Rate limits, storage, and compute quotas enforced dynamically per edition.

Scalability is achieved through horizontal scaling (stateless microservices) and vertical scaling for data workloads.

Security & Privacy¶

Zero Trust Enforcement: All inter-service traffic authenticated with workload identities + mTLS.
Data Protection:
- Encryption at rest (AES-256)
- Encryption in transit (TLS 1.3)
Secrets Management: All secrets stored in Key Vault; secret-less by default with managed identities.
Privacy by Design: PII minimized, data masking in logs, erasure supported via APIs.
Compliance Baselines: GDPR, SOC2, HIPAA (configurable by product/edition).

Observability¶

Tracing: 100% of requests/events tagged with traceId, tenantId, edition.
Metrics: Golden signals for latency, errors, saturation, traffic.
Logging: Structured, JSON, PII redacted by default.
Dashboards: Predefined for each context; customizable by tenant/edition.

Observability cannot be disabled. It is a core guardrail across all generated services.

Compliance & Portability¶

Compliance-by-Default: All services scaffolded with standard controls.
Auditability: Immutable audit logs retained for 7 years (configurable).
Portability: Azure-first, but abstractions for SQL → PostgreSQL, Service Bus → Kafka, Key Vault → Vault.

Generated SaaS solutions can be redeployed across clouds without rewriting core services.

SLI/SLO Baseline Table¶

Dimension	SLI	SLO (Baseline)	Notes / Error Budget
Performance	p95 read/write latency	≤ 200 ms / 350 ms	Budget: 5% requests may exceed
Availability	Platform uptime	≥ 99.9%	Free edition: 99.5%
Scalability	Max tenants per cluster	≥ 10,000	Quotas per edition
Security	% secrets managed via KV/MSI	100%	No exceptions allowed
Privacy	% requests with PII redacted in logs	100%	Guardrail; cannot be disabled
Observability	% requests traced w/ traceId,tenantId	100%	Mandatory invariant
Compliance	Audit record retention	7 years	Configurable override possible

Error Budgets¶

Availability: If uptime falls below 99.9%, feature velocity is slowed until SLO compliance restored.
Performance: 5% budget for latency breaches; if exceeded, scaling or optimization prioritized.
Security/Privacy: No error budget — violations trigger incident severity 1.
Observability: No error budget — missing telemetry is considered a defect.

Solution Architect Notes¶

Apply edition-specific SLO overlays (Enterprise gets stricter SLOs than Free/Standard).
Use synthetic checks for critical flows (tenant onboarding, login, billing).
Embed NFRs into CI/CD as automated quality gates.
Reassess SLOs quarterly and align with customer contracts.

Platform Reference Architecture¶

Purpose¶

Define a reusable runtime blueprint for the generic SaaS platform across environments. The architecture emphasizes edge security, multi-tenant isolation, event-driven collaboration, and observability by default, while remaining portable between AKS and Azure Container Apps (ACA).

Overview¶

The platform is partitioned into clear planes and tiers:

Edge Plane: Front Door/WAF → API Gateway (YARP) → UI/Portal
Control Plane: Identity Provider, Policy/Config, CI/CD, IaC
Data & Messaging Plane: Azure SQL (primary), optional MongoDB, Redis, Storage, Service Bus
Workload Plane: Core microservices per bounded context (Identity, Tenant, Billing, Config, Usage, Notifications, Audit, AI Orchestration)
Observability Plane: OpenTelemetry collectors, Logs, Metrics, Traces, Dashboards/Alerts
Jobs Plane: Hangfire/KEDA workers, scheduled tasks, DLQ replayers

C4 Container (Mermaid)¶

C4Container
title Generic SaaS Platform — Container View

Person_TenantAdmin( Tenant Administrator )
Person_EndUser( End User )
Person_Operator( Platform Operator / SRE )

Container_Browser(Web UI / Tenant Portal, "SPA", "OIDC client")
Container_Admin(Admin Console, "SPA", "OIDC client")
Container_Gateway(API Gateway, "YARP", "AuthN/Z, routing, rate limits, transforms")
Container_IdP(Identity Provider, "OpenIddict/Azure AD", "Tokens, RBAC/ABAC")
Container_Svc_Tenant(Tenant Service, ".NET", "Tenant lifecycle, residency")
Container_Svc_Billing(Billing Service, ".NET", "Plans, subscriptions, invoices")
Container_Svc_Config(Config & Feature Flags, ".NET", "Flags, edition overrides")
Container_Svc_Usage(Usage & Metering, ".NET", "Meters, quotas, aggregates")
Container_Svc_Notify(Notifications, ".NET", "Email/SMS/Webhooks via ACL")
Container_Svc_Audit(Audit & Compliance, ".NET", "Append-only audit log")
Container_Svc_Metadata(SaaS Core Metadata, ".NET", "Products, editions, entitlements")
Container_Svc_AI(AI Orchestration, ".NET + SK", "Agentic scaffolding, tests/docs")
Container_Bus(Service Bus, "Topics/Queues", "Async events, DLQ")
Container_SQL(Azure SQL, "Relational", "Primary state (RLS/tenant guards)")
Container_Mongo(MongoDB (opt), "Document", "Templates & payloads (notifications)")
Container_Redis(Redis, "Cache", "Flag evaluation, token/claim caches")
Container_Storage(Storage, "Blob", "Artifacts, exports")
Container_Obs(Observability Stack, "OTel/Prometheus/Grafana/Logs", "Traces, metrics, logs")
Container_Jobs(Jobs Runtime, "Hangfire/KEDA", "Schedulers, DLQ replay")

Rel(Tenant Administrator, Web UI / Tenant Portal, "Browser, OIDC")
Rel(End User, Web UI / Tenant Portal, "Browser, OIDC")
Rel(Platform Operator / SRE, Admin Console, "Browser, OIDC")

Rel(Web UI / Tenant Portal, API Gateway, "HTTPS")
Rel(Admin Console, API Gateway, "HTTPS")
Rel(API Gateway, Identity Provider, "OIDC flows, token validation")

Rel(API Gateway, Tenant Service, "mTLS, JWT")
Rel(API Gateway, Billing Service, "mTLS, JWT")
Rel(API Gateway, Config & Feature Flags, "mTLS, JWT")
Rel(API Gateway, Usage & Metering, "mTLS, JWT")
Rel(API Gateway, Notifications, "mTLS, JWT")
Rel(API Gateway, Audit & Compliance, "mTLS, JWT")
Rel(API Gateway, SaaS Core Metadata, "mTLS, JWT")
Rel(API Gateway, AI Orchestration, "mTLS, JWT")

Rel(Tenant Service, Service Bus, "publish/subscribe")
Rel(Billing Service, Service Bus, "publish/subscribe")
Rel(Config & Feature Flags, Service Bus, "publish/subscribe")
Rel(Usage & Metering, Service Bus, "publish/subscribe")
Rel(Notifications, Service Bus, "consume events")
Rel(Audit & Compliance, Service Bus, "consume events")
Rel(AI Orchestration, Service Bus, "publish/subscribe")

Rel(Tenant Service, Azure SQL, "ADO.NET")
Rel(Billing Service, Azure SQL, "ADO.NET")
Rel(Config & Feature Flags, Azure SQL, "ADO.NET / Redis")
Rel(Usage & Metering, Azure SQL, "ADO.NET")
Rel(Notifications, MongoDB (opt), "Drivers")
Rel(Audit & Compliance, Azure SQL, "Append-only")
Rel(AI Orchestration, Storage, "Artifacts")

Rel(All, Observability Stack, "OTLP (traces/metrics/logs)")
Rel(Jobs Runtime, Service Bus, "Job queues, DLQ replay")

Hold "Alt" / "Option" to enable pan & zoom

Runtime Topologies¶

AKS (Kubernetes)

Best for large multi-tenant scale, advanced networking, service mesh, and fine-grained autoscaling.
Supports mTLS via service mesh, network policies, and workload identities.
Suited for Enterprise and products requiring custom sidecars or high-throughput messaging.

ACA (Azure Container Apps)

Simpler operational footprint, KEDA-native autoscaling on HTTP/queue metrics.
Excellent for SMB/Standard offerings, jobs, workers, and bursty workloads.
Can coexist with AKS as a jobs plane (DLQ replayers, batch processors).

Both variants keep the contract and container boundaries identical; choice is an operational concern decided per environment or product.

Network & Trust Boundaries¶

Public Edge: Front Door/WAF terminates TLS; only the Gateway is internet-exposed.
Private App Network: All services run in private subnets with deny-by-default policies.
Data Network: Databases and storage are accessible via Private Link; no public endpoints.
Observability & CI/CD: Access via managed identities and least privilege; audit logs immutable.

Trust boundaries are drawn at the Gateway edge and between Workload Plane ↔ Data Plane. Every call across a boundary uses mTLS and JWT validation (for user/service identity).

Identity & Secrets¶

User/Client Identity: OIDC tokens issued by OpenIddict/Azure AD, including tenantId, edition, entitlements.
Workload Identity: Managed identity for services; avoid secret injection; Key Vault for any remaining secrets.
Policy Enforcement: Scope/role checks at the gateway and policy filters in services (RBAC/ABAC).
Key Rotation: Automated; rotation ≤ 90 days baseline.

Data & Storage Layer¶

Primary: Azure SQL with tenant guards (RLS or repository filters), strict schema ownership per context.
Optional: MongoDB for notification templates/payloads; Redis for low-latency flag evaluation and token claims.
Artifacts: Blob storage for exports, audit bundles, and AI scaffolding outputs.
Retention: Defaults per context (e.g., Audit 7y, Usage raw 90d + aggregates).

Messaging Layer¶

Azure Service Bus: Topics/queues for domain events, sagas, and DLQs.
MassTransit: Implements outbox/inbox, retry with jitter, and saga coordination.
Contracts: Canonical events with versioning; idempotency keys; signed webhook exports at the edge.

Observability Plane¶

OpenTelemetry Everywhere: Traces, logs, metrics with required attributes (traceId, tenantId, edition).
Dashboards: Per context and per tenant/edition views; error budgets and SLO tracking.
Log Hygiene: Structured JSON, PII redaction by default, correlation with audit records.

Jobs & Scheduling¶

Recurring/Scheduled: Hangfire or ACA jobs with UTC cron; idempotent job keys.
Event-Driven: KEDA scales workers off queue length and lag (DLQ replayers, compactions).
Observability: Job success/failure metrics, run durations, and retries are first-class signals.

High Availability & Scaling¶

Stateless Services: Horizontal Pod Autoscaler (AKS) or KEDA (ACA).
Stateful Stores: Active geo-replication (SQL), zone-redundant storage, and backup/restore runbooks.
Multi-Tenancy Scaling: Edition-aware quotas; per-tenant throttling at the gateway and policy-based limits in services.
Blue/Green & Canary: Gateway routes plus deployment strategies to minimize risk.

Failure Modes & Recovery (selected)¶

Gateway Degradation: Fail closed for auth; serve static maintenance page via Front Door.
Bus Backlog: KEDA autoscale; overflow to DLQ; DLQ replay jobs with circuit breakers.
DB Hot Partition: Trigger tenant sharding or schema-per-tenant promotion per policy.
IdP Outage: Use cached tokens within acceptable TTL; degrade gracefully for non-critical flows.

Solution Architect Notes¶

Start deployments with ACA for simplicity; promote to AKS where fine-grained control or mesh features are required.
Keep the Gateway thin; push business decisions into services and policy layers.
Enforce mTLS + workload identity ubiquitously; secrets are exceptions, not norms.
Make observability non-negotiable: traces/logs/metrics must ship before exposing any public endpoint.

Edge & API Gateway¶

Purpose¶

Establish a secure, policy-driven ingress that terminates public traffic, authenticates at the edge, resolves tenants, enforces edition-aware quotas, and steers requests to the correct backend services. The gateway is a custom .NET Core solution built on YARP (reverse proxy) so we can embed ConnectSoft’s tenancy/security/observability invariants and progressive-delivery controls (canary, blue/green).

Ingress Topology & Trust Boundaries¶

Public Edge: Azure Front Door + WAF (TLS 1.3, DDoS protection) → Gateway (internet-facing).
Private App Network: Gateway to services over mTLS within a private VNet.
Identity Boundary: Gateway is the primary policy enforcement point; it validates OIDC tokens and workload identities, and stamps downstream calls with normalized headers and claims.
Observability Boundary: Gateway attaches mandatory correlation and tenancy headers (e.g., x-trace-id, x-tenant-id, x-edition) and emits OTel spans.

sequenceDiagram
  participant C as Client
  participant F as Front Door/WAF
  participant G as API Gateway (.NET+YARP)
  participant I as Identity (OpenIddict/AAD)
  participant S as Backend Service

  C->>F: HTTPS request
  F->>G: Forward with TLS
  G->>I: Validate token / challenge if needed
  I-->>G: JWT (tenantId, edition, scopes)
  G->>G: Tenant resolution (host/header/token), rate-limit, authZ
  G->>S: mTLS + normalized headers (traceId, tenantId, edition)
  S-->>G: Response
  G-->>C: Response (transforms, caching hints)

Hold "Alt" / "Option" to enable pan & zoom

Request Lifecycle (Edge Policies)¶

Transport & TLS: Front Door terminates public TLS; Gateway re-terminates internally and enforces HSTS and strict cipher suites.
Authentication: OIDC bearer validation at the gateway; anonymous routes explicitly whitelisted (e.g., /public/*, webhook callbacks).
Tenant Resolution: Precedence Header (x-tenant-id) → Hostname subdomain → Token claim; rejects ambiguous/missing tenancy.
Authorization: RBAC/ABAC decision at edge when possible (scope/role/edition checks); fine-grained decisions may be delegated with policy headers.
Quota & Rate Limiting: Edition-aware token-bucket limits with leaky-bucket smoothing; per-tenant counters.
Routing & Versioning: Path, header, or media-type versioning (e.g., /v1/..., Accept: application/vnd.connectsoft.v2+json).
Transforms: Add/remove/normalize headers, response shape harmonization for legacy clients.
Progressive Delivery: Weighted routing for canary and blue/green; circuit-breakers and outlier detection.
Observability: OTel spans with traceId, spanId, tenantId, edition, routeId, backendId; structured logs with PII redaction.

Authentication & Authorization at the Edge¶

Tokens: OIDC JWT with mandatory claims: sub, tenantId, edition, scp (scopes), roles[], entitlements{}.
Service-to-Service: Gateway accepts mTLS and/or signed JWT from trusted callers (jobs, webhooks) and issues downstream identities via signed headers plus mTLS to services.
Anonymous Access: Explicit allow-list (e.g., health, well-known, webhook receiver). All else requires valid token.
Policy Evaluation: Coarse-grained at edge (block early), fine-grained within services (resource-level ABAC). Deny-by-default.

Edition-Aware Rate Limits & Quotas (defaults)¶

Edition	Global RPM (per tenant)	Burst	Concurrency	Notes
Free	60	120	10	Trial-friendly; strict retries
Standard	600	1200	50	Typical workloads
Enterprise	3000	6000	200	Prioritized queues & support
Custom	Configurable	–	–	Set via edition metadata

Enforced at edge; mirrored by backend safeguards to prevent “quota bypass.”

Routing, Versioning & Canary¶

Routes: Path-based (/api/tenants/*), tag-based (x-product, x-context), and method-aware (PCI-safe rules for billing).
Versioning: Path or content-negotiation; gateway validates supported versions and forwards x-api-version downstream.
Canary / Blue-Green: Weighted clusters (e.g., 90/10) and header-based targeting for internal testers (x-canary: true). Automatic rollback on SLO breach (latency/error-rate thresholds).
Shadow Traffic (optional): Duplicate a fraction of reads to a shadow backend for safe testing without client impact.

Request/Response Transformations (examples)¶

Inbound: Strip hop-by-hop headers; enforce x-tenant-id; normalize Accept and Content-Type; inject x-correlation-id when missing.
Outbound: Map backend error envelopes to a factory-standard error schema; attach cache hints for GETs; remove internal headers.

Resilience at the Edge¶

Timeouts: Sensible route-level timeouts (e.g., 2s internal, 5s external).
Retries: Idempotent methods only (GET/HEAD/OPTIONS) with exponential backoff + jitter.
Circuit Breakers: Per-destination outlier detection; automatic ejection and gradual recovery.
Backpressure: 429 with Retry-After and tenant-specific hints; shed load for Free first.

Observability (Gateway Signals)¶

Traces: gateway.request, gateway.route.resolve, gateway.authn, gateway.authz, gateway.ratelimit, gateway.proxy.
Metrics: Requests/sec by route/tenant/edition, p95/p99 latency, upstream error rate, ejected destinations, rate-limit hits, auth failures.
Logs: Redacted request/response summaries with route, tenant, edition, traceId; WAF correlation.

Configuration Templates (illustrative)¶

YARP Routes & Clusters (weighted canary + transforms)

{
  "ReverseProxy": {
    "Routes": [
      {
        "RouteId": "tenant-api",
        "Match": { "Path": "/api/tenants/{**catch-all}" },
        "Transforms": [
          { "RequestHeader": "X-Correlation-Id", "Set": "{TraceId}" },
          { "RequestHeader": "X-Tenant-Id", "Set": "{TenantId}" },
          { "RequestHeaderOriginalHost": "true" }
        ],
        "ClusterId": "tenant-svc",
        "AuthorizationPolicy": "RequireAuthenticatedUser",
        "RateLimiterPolicy": "EditionAwarePolicy",
        "CorsPolicy": "Default"
      }
    ],
    "Clusters": {
      "tenant-svc": {
        "Destinations": {
          "stable": { "Address": "http://tenant-svc-v1/" },
          "canary": { "Address": "http://tenant-svc-v2/" }
        },
        "LoadBalancingPolicy": "PowerOfTwoChoices",
        "SessionAffinity": { "Enabled": false },
        "HealthCheck": { "Passive": { "Enabled": true } },
        "Metadata": { "CanaryWeights": "stable=90;canary=10" }
      }
    }
  }
}

.NET Rate Limiting (edition-aware)

builder.Services.AddRateLimiter(options =>
{
    options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;
    options.AddPolicy("EditionAwarePolicy", context =>
    {
        var edition = context.User?.FindFirst("edition")?.Value ?? "Free";
        var (permit, replen, burst) = edition switch
        {
            "Enterprise" => (3000, TimeSpan.FromMinutes(1), 6000),
            "Standard"   => (600,  TimeSpan.FromMinutes(1), 1200),
            _            => (60,   TimeSpan.FromMinutes(1), 120),
        };
        return RateLimitPartition.GetTokenBucketLimiter(
            partitionKey: $"{edition}:{context.Request.Headers["x-tenant-id"]}",
            factory: _ => new TokenBucketRateLimiterOptions
            {
                TokenLimit = burst,
                TokensPerPeriod = permit,
                ReplenishmentPeriod = replen,
                AutoReplenishment = true,
                QueueLimit = 0
            });
    });
});

Program.cs (YARP + OIDC + OTel + mTLS enforcement)

builder.Services.AddAuthentication("Bearer")
    .AddJwtBearer("Bearer", o =>
    {
        o.Authority = builder.Configuration["IdP:Authority"];
        o.TokenValidationParameters.ValidateAudience = false;
        o.MapInboundClaims = false;
    });

builder.Services.AddReverseProxy().LoadFromConfig(builder.Configuration.GetSection("ReverseProxy"));
builder.Services.AddOpenTelemetry().WithTracing(t => t.AddAspNetCoreInstrumentation().AddHttpClientInstrumentation());
app.UseAuthentication();
app.Use(async (ctx, next) =>
{
    // Enforce mTLS from Front Door private link or internal LB
    if (!ctx.Connection.ClientCertificate?.Verify() ?? true)
    {
        ctx.Response.StatusCode = StatusCodes.Status403Forbidden;
        return;
    }
    await next();
});
app.UseRateLimiter();
app.MapReverseProxy();

Failure Modes & Playbooks (selected)¶

Token Validation Failures: Return 401/WWW-Authenticate; verify IdP health; enable cached signing keys with TTL; fail closed for sensitive routes.
Backend Saturation: Trigger circuit breaker; reduce canary weight to 0; raise 429 with Retry-After.
Route Drift / Misconfig: Config lint & contract tests in CI; runtime config reload guarded by feature flag; instant rollback to last-known-good.
Tenant Ambiguity: Reject with 400 + problem details; provide diagnostic trace; require explicit x-tenant-id or correct host.

Solution Architect Notes¶

Keep the gateway thin but policy-rich: authN/Z, tenancy, quotas, and traffic shaping; no business logic.
Prefer header-based canary steering for internal testing and percentage-based for public rollouts.
Make edition-aware rate limiting visible to tenants via headers and usage endpoints.
Treat the gateway as a security product: frequent pen tests, strict dependency hygiene, and SBOM/signing in CI.

Identity, Authentication & Authorization¶

Purpose¶

Provide a unified, multi-tenant identity plane for users and services. Standardize OAuth2/OIDC flows, token and claim design, RBAC/ABAC policy enforcement, and workload identity for service-to-service calls. Support both a custom .NET (OpenIddict) Identity Provider and external IdPs (Azure AD/Okta) behind a federation boundary.

Trust Boundaries & High-Level Flows¶

flowchart LR
  U[User/Client App] -->|OIDC| G[Edge Gateway]
  G --> IdP[Identity Provider (OpenIddict/AAD)]
  G --> Svc[Backend Services]
  IdP -->|JWT (tenantId, edition, scopes, roles)| G
  G -->|mTLS + normalized identity headers| Svc
  Svc -->|policy check (RBAC/ABAC)| Svc

  subgraph Identity Plane
    IdP
  end
  subgraph Workload Plane
    Svc
  end
  classDef boundary stroke-width:2,stroke:#999

Hold "Alt" / "Option" to enable pan & zoom

Boundaries

Public boundary: Clients ↔ Gateway (OIDC/OAuth2); WAF + TLS 1.3.
Control boundary: Gateway ↔ Services (mTLS + JWT, workload identity).
Identity boundary: Gateway trusts IdP token-signing keys; services trust gateway-issued identity context and validate tokens again internally.

Identity Provider Options¶

Primary (factory-default): Custom .NET IdP using OpenIddict

Supports Authorization Code + PKCE, Client Credentials, Device Code (optional), and Refresh Tokens.
Multi-tenant claim issuance; edition/entitlement enrichment from SaaS Core Metadata.
Local user store (ASP.NET Identity) plus federation with external IdPs.
SCIM 2.0 (Enterprise) for just-in-time provisioning and deprovisioning.

Federated (enterprise option): Azure AD / Okta

External IdP via OIDC federation; Federation ACL maps external attributes to internal roles/scopes.
Supports SSO, conditional access, MFA, and B2B invites.

Token Model (JWT)¶

Standard claims

sub, iss, aud, exp, iat, nbf

Multi-tenant & edition claims

tenantId: authoritative tenant identifier (issued by Tenant Management)
edition: plan identifier (e.g., Free, Standard, Enterprise, or custom)
entitlements: bag of feature flags/limits at issuance time (digest, not the source of truth)

Authorization claims

scp (scopes): API permissions (coarse-grained)
roles: high-level roles (e.g., tenant_admin, support_agent, billing_admin)
abac: optional attribute set for policy engines (e.g., {"region":"EU","dataClass":"PII"})

Service identity

Client Credentials flow issues tokens for service principals with appId, aud, and minimal scopes.
Downstream calls are authenticated with mTLS and validated JWT, with identity context propagated via headers (x-tenant-id, x-actor, x-roles, x-scope).

Lifetimes (defaults)

Access token: 15 minutes
Refresh token: 24 hours (rotating)
Client credentials token: 10 minutes
Key rotation: ≤ 90 days (automated), JWKS exposed

RBAC / ABAC Authorization¶

RBAC (role-based)

Roles assigned per tenant; example roles: tenant_admin, member, billing_admin, support_agent, operator.
Services enforce role gates for administrative operations.

ABAC (attribute-based)

Policies evaluate attributes from token + request context (tenant, edition, resource owner, region, data class).
Example: “Users with role support_agent may read logs only when tenant.support_access=true and dataClass != PII.”

Hybrid

Coarse authorization at the gateway (scopes/roles).
Fine-grained authorization inside services (ABAC over resource attributes).

Scope Catalog (illustrative)¶

Scope	Audience	Description	Typical Roles
`tenant.read`	Tenant API	Read tenant profile & settings	`tenant_admin`, `support_agent`
`tenant.manage`	Tenant API	Create/Update/Delete tenant resources	`tenant_admin`
`billing.read`	Billing API	Read subscriptions/invoices	`billing_admin`, `tenant_admin`
`billing.manage`	Billing API	Modify plans, payment methods	`billing_admin`
`config.read`	Config API	Read flags and settings	`member`, `tenant_admin`
`config.manage`	Config API	Create/Update flags, overrides	`tenant_admin`
`usage.read`	Usage API	Read metering/quota	`tenant_admin`, `support_agent`
`audit.read`	Audit API	Query audit logs	`tenant_admin`, `operator`
`notify.send`	Notify API	Send messages (templatized)	`tenant_admin`
`ai.orchestrate`	AI API	Invoke agentic flows/tools	`tenant_admin`, `engineer`

Scopes are additive; deprecations follow a sunset policy. Services validate both scope and role where applicable.

sequenceDiagram
  participant B as Browser/App
  participant G as Gateway
  participant I as IdP (OpenIddict/AAD)
  participant M as Metadata (Products/Entitlements)

  B->>G: /authorize
  G->>I: OIDC Auth Code + PKCE
  I-->>B: auth code
  B->>G: code exchange
  G->>I: token request
  I->>M: enrich claims (tenant, edition, entitlements)
  M-->>I: entitlement snapshot
  I-->>G: id_token + access_token + refresh_token
  G-->>B: session established (SPA stores tokens securely)

Hold "Alt" / "Option" to enable pan & zoom

Notes

Enrichment pulls current edition/entitlements at issuance; services must still consult Config for real-time flag evaluation.
PKCE & MFA recommended for all first-party SPAs and public clients.

Tenant Resolution & Federation¶

Resolution precedence: x-tenant-id header → subdomain → token claim. Gateway rejects ambiguous requests.
Federation: Enterprise tenants may authenticate via external IdPs; federation ACL translates external groups/claims into internal roles/scopes.
SCIM (Enterprise): Automates user/role provisioning; deprovision triggers session revocation.

Workload Identity (service-to-service)¶

Managed Identity (AKS/ACA) binds workloads to identities; outbound calls signed at transport (mTLS) and application (JWT).
No static secrets in services; Key Vault for exceptional credentials (e.g., third-party webhooks).
Downstream identity propagation: services forward correlation and minimal identity context; avoid token forwarding unless necessary.

Security & Privacy Controls¶

Zero Trust: deny-by-default, least privilege, explicit allow-lists for anonymous routes.
mTLS: gateway↔service and service↔service; certificate pinning where feasible.
PEP/PDP separation: gateway acts as Policy Enforcement Point; services host Policy Decision logic for resource-level checks.
PII safety: never write raw PII to logs; redaction at sinks; audit every elevation (admin actions).
Consent & Terms: first-class records per tenant; tracked in Audit.

Observability Signals¶

Auth signals: token issuance latency, failed validations, JWKS fetch errors.
Access signals: authz denials by route/scope/role, edition-policy mismatches.
Federation: IdP health, SCIM drift (orphaned accounts), SSO error rates.
Secrets/Certs: rotation age, expiring keys/certs, failed rotations (SEV-1).

Failure Modes & Mitigations¶

IdP outage: cached signing keys and grace tokens for short read-only windows; degrade non-critical flows.
Clock skew: NTP enforcement; leeway on nbf/exp validation (≤ 2 minutes).
Stale entitlements: tokens carry snapshots; Config is source of truth for runtime decisions; short access-token lifetimes reduce drift.
Compromised refresh token: rotate on every use; maintain reuse detection; revoke sessions on suspicion.

Solution Architect Notes¶

Prefer OpenIddict for first-party control and rapid feature iteration; use federation to honor enterprise SSO requirements without coupling domain models to external IdPs.
Keep tokens small and short-lived; push dynamic decisions to Config and policy engines.
Enforce workload identity + mTLS ubiquitously; treat any secret-based fallbacks as temporary waivers with expiry.
Model authorization outside the UI; all decisions must be verifiable at API boundaries and auditable.

Multi-Tenancy Strategy¶

Purpose¶

Define how the platform identifies, isolates, and governs tenants across the stack. This section standardizes tenant resolution, isolation levels (pooled/schema/database), configuration/flags enforcement, and onboarding & migration flows so products can scale safely from trials to large enterprises without redesign.

Tenancy Model Overview¶

Tenant as first-class identity: every request, job, event, and data row is associated with exactly one tenantId (or an allowed system actor).
Edition-aware policies: quotas, features, and SLO overlays are resolved at runtime per tenant.
Security & observability invariants: tenant context is mandatory at ingress, persisted with data, and present on all telemetry.

Tenant Resolution¶

Resolution precedence (strict):

Header — x-tenant-id (authoritative in service-to-service calls)
Host — subdomain.example.com → tenantId mapping
Path — /t/{tenantId}/... (supported for specific APIs)
Token — JWT claim tenantId (validated but not preferred for multi-tenant APIs)

If the gateway detects ambiguity or mismatch (e.g., header vs host disagree), the request is rejected with a 400 including problem details and a correlation ID.

Resolver flow (edge):

flowchart LR
  A[Request Arrives] --> B{Has x-tenant-id?}
  B -- Yes --> C[Validate & Normalize Id]
  B -- No --> D{Subdomain present?}
  D -- Yes --> E[Lookup mapping -> tenantId]
  D -- No --> F{Path /t/{id}?}
  F -- Yes --> C
  F -- No --> G{Token has tenantId?}
  G -- Yes --> C
  G -- No --> X[Reject 400: tenant_ambiguous]
  C --> H{Tenant active & region allowed?}
  H -- Yes --> I[Attach tenant to context, continue]
  H -- No --> X

Hold "Alt" / "Option" to enable pan & zoom

Isolation Levels¶

Isolation	Description	When to Use	Data Guarding	Strengths	Trade-offs
Pooled	Shared schema + tables, `tenantId` column on all rows	Trials, SMB, moderate scale	Repo guards + RLS (Row-Level Security)	Highest density, lowest cost	Hot-tenant contention; noisy neighbor risk
Schema-per-Tenant	Dedicated schema per tenant in same DB	Mid-market, heavier customizations	Schema scoping + connection factory	Easier per-tenant backup/restore; reduced contention	Higher catalog bloat; ops overhead
Database-per-Tenant	Dedicated database/server per tenant	Enterprise, regulatory isolation	Network isolation + DB-level IAM	Strongest isolation; independent lifecycle	Highest cost; cross-tenant reporting complexity

Promotion path: pooled → schema → database, triggered by tenant size, SLO breach risk, or regulatory needs. Promotions are online using CDC-based sync and dual-writes during cutover (see Migration).

Tenancy Enforcement (defense in depth)¶

Layer	Enforcement Mechanism	Mandatory Checks
Gateway	Resolver → inject `x-tenant-id`; deny ambiguous; edition-aware rate limit	Token validation, tenant status (active/suspended), region allow-list
Service API	Policy filters & guards	Require tenant in context; cross-tenant IDs rejected
Domain Logic	Tenant-scoped commands/queries	Invariants include `tenantId`; never accept client-provided cross-tenant references
Repository/DAL	RLS or tenant filters; parameterized queries	`WHERE tenant_id = @tenantId` always; no string-concatenated SQL
Messaging	Envelope headers (`tenantId`, `traceId`, `edition`); scoped consumers	Consumers reject missing/foreign tenant headers; per-tenant DLQ segregation
Cache	Tenant-scoped keys	`cache:{tenantId}:{key}`; no shared mutable data
Storage/Blob	Tenant prefix & ACLs	`tenants/{tenantId}/...`; private containers; tenant KMS policies (optional)
Observability	Required attributes on spans/logs/metrics	`tenantId`, `edition`, `traceId` present; queries default to tenant scope

Configuration, Flags & Entitlements¶

Resolution order: platform defaults → product defaults → edition pack → tenant overrides → (optional) user context.
Flag evaluation: low-latency via cache (Redis) with consistent hashing; cache entries are tenant-scoped and short-lived.
Entitlements in tokens: treated as snapshots; definitive decision uses Config at request time for drift-free enforcement.

Data Residency & Regional Routing¶

Residency attribute on tenant (e.g., EU-WEST, US-EAST) selected at onboarding or via enterprise contract.
Routing at gateway directs requests to the region’s workload plane; cross-region access is denied unless an explicit policy allows it.
Data stores are regionally isolated with Private Link; cross-region replication follows DR policy (RPO/RTO).

Onboarding & Lifecycle¶

States: requested → provisioning → active → suspended → deleted

sequenceDiagram
  participant U as Tenant Admin
  participant GW as Gateway
  participant TEN as Tenant Mgmt
  participant META as SaaS Core Metadata
  participant CONF as Config/Flags
  participant BILL as Billing
  participant IDP as Identity

  U->>GW: Sign up / create tenant
  GW->>TEN: create_tenant(request)
  TEN->>META: seed_product_edition(entitlements)
  TEN->>CONF: seed_default_flags(tenantId, edition)
  TEN->>BILL: create_subscription(plan)
  BILL-->>TEN: subscription.pending
  TEN->>IDP: provision_realm(tenant claims, roles)
  TEN-->>GW: provisioning_complete
  Note over TEN: State = active
  GW-->>U: Activation success + admin invite

Hold "Alt" / "Option" to enable pan & zoom

Suspend/Resume/Delete

Suspend → revoke sessions; freeze subscription; block writes (read-only mode optional).
Delete → two-phase: soft-delete with grace → hard-delete (after retention/erasure workflows).

Migration & Promotion Flows¶

Use cases

Hot tenant promotion from pooled → schema → database.
Region move for residency or latency.
Edition-driven data shape change (e.g., enabling advanced features).

Approach (pooled → schema/db):

Prepare: Create target schema/DB; provision IAM and RLS.
Sync: Enable CDC or change feed; backfill historical data; start dual-writes.
Cutover: Drain inflight ops; flip tenant connection mapping at resolver; verify read/write health.
Finalize: Disable dual-writes; decommission old partition after retention window.

Zero-downtime guardrails

Idempotent writes; natural keys stable across partitions.
All services obtain tenant-specific connection info via Tenant Directory cache (with TTL and fast invalidation).
Feature flag tenant.migration.read_only toggles to protect critical sections.

Observability & SLOs (tenancy-centric)¶

Signal	Target	Notes
Tenant onboarding p95	≤ 60s	create → active
Resolver failure rate	< 0.01%	ambiguous/missing tenant
Cross-tenant access violations	0	treated as SEV-1
Promotion cutover duration	≤ 60s	dual-write window bounded
Flag evaluation latency p95	≤ 5 ms	local/Redis-backed

Dashboards include per-tenant views for latency, error rates, quota consumption, and migration progress.

Security & Privacy Notes¶

Tenant authority lives in Tenant Management; all other contexts validate inbound tenantId against their read model.
No cross-tenant joins in read models unless explicitly marked “multi-tenant analytics” and routed through safe aggregation pipelines.
Erasure support: tenant-owned PII deletions orchestrated via workflow; audit remains immutable with tokenized references.

Failure Modes & Mitigations¶

Ambiguous resolution: 400 with diagnostics; require explicit header; emit audit event.
Noisy neighbor: edition-aware throttling at edge; per-tenant queue partitioning; promote isolation level.
Stale connection mapping: short TTL + cache bust on migration; fallback to directory lookup.
Cross-tenant leak bug: automated tenant-fence tests in CI; runtime guard that verifies returned rows belong to request tenant (sample-based).
Region outage: failover only for tenants whose contracts permit cross-region DR; others remain isolated per residency policy.

Solution Architect Notes¶

Start all tenants pooled; automate promotion paths and keep them routine—not exceptional.
Prefer RLS where supported; otherwise enforce repository guards and property-based testing to prove scoping.
Keep the Tenant Directory authoritative for connection/partition info; never embed static routing in code.
Treat tenant context as non-optional telemetry—it’s the first dimension for debugging, scaling, and support.

Event-Driven Backbone & Contracts¶

Purpose¶

Establish a canonical, versioned event backbone that connects bounded contexts with loose coupling and reliable delivery. Standardize the event envelope, headers, topics/queues, outbox/inbox patterns, idempotency, and DLQ handling, so teams can ship independently while maintaining a stable integration surface.

Principles¶

Event-first collaboration: Services publish domain facts; consumers react and build local read models.
Canonical envelope: All events carry the same required headers; payloads are versioned and backwards-compatible.
At-least-once + idempotency: Producers use outbox; consumers use inbox + idempotency keys.
Tenant isolation: Events are tenant-scoped by default; cross-tenant payloads are prohibited unless flagged as aggregate analytics.
Observable by design: Every event includes telemetry context and is trace-linked to causative actions.

Canonical Envelope (CloudEvents-aligned)¶

Headers (required)

type — semantic event name with version suffix, e.g., tenant.created.v1
id — globally unique event id (ULID/GUID)
source — service/bounded-context name, e.g., tenant-svc
specversion — 1.0
time — RFC3339 timestamp
traceId — W3C traceparent correlation id
tenantId — authoritative tenant identity
edition — edition at the time of emission (snapshot)
schemaVersion — semantic version of the data payload (e.g., 1.0.0)
partitionKey — default tenantId (for ordering at consumer/queue level)
key — idempotency key for the business entity / sequence (e.g., subscription id)

Payload (data)

Domain-specific fields (no PII unless absolutely required; prefer references and lookups).

Event Bus Topology¶

flowchart LR
  subgraph Producers
    TEN[Tenant Svc] -->|Outbox| EB(Service Bus Topics)
    BILL[Billing Svc] -->|Outbox| EB
    CONF[Config Svc] -->|Outbox| EB
    USE[Usage Svc] -->|Outbox| EB
    IDP[Identity Svc] -->|Outbox| EB
    NOTIF[Notifications] -->|Outbox| EB
    AUD[Audit Svc] -->|Outbox| EB
  end

  EB -->|Subscriptions| TEN_SUB[Tenancy Subscriptions]
  EB --> BILL_SUB[Billing Subscriptions]
  EB --> CONF_SUB[Config Subscriptions]
  EB --> USE_SUB[Usage Subscriptions]
  EB --> AUD_SUB[Audit Archive]
  EB --> NOTIF_SUB[Delivery Workers]

  classDef svc fill:#0b6,stroke:#094,color:#fff;
  classDef bus fill:#234,stroke:#123,color:#fff;
  classDef sub fill:#357,stroke:#234,color:#fff;

Hold "Alt" / "Option" to enable pan & zoom

Conventions

Topic-per-domain (e.g., tenant-events, billing-events, config-events, usage-events, identity-events, notifications-events, audit-events).
Subscription-per-consumer with optional filters (SQL filters on type, tenantId, edition).
DLQ per subscription; DLQ contents are immutable and auditable.

Versioning Strategy¶

Event type includes a major payload version (.v1, .v2).
Additive changes (new fields) do not bump major; consumers must be tolerant readers.
Breaking changes create a new type (tenant.created.v2). Old and new may coexist during migration.
Deprecation window announced in contracts; observability verifies consumer adoption.

Outbox / Inbox / Idempotency¶

Outbox (producer)

Transactionally stores pending events with business state changes.
Background dispatcher publishes to bus with retry/backoff and exactly-once handoff semantics to the bus (effectively at-least-once end-to-end).

Inbox (consumer)

Stores processed event ids/keys to de-duplicate.
Idempotency key chosen per aggregate (e.g., subscriptionId, userId, flagName@version).

Reentrancy rules

Handlers must be idempotent and side-effect-safe.
Use sagas for long-running processes; each step commits with an idempotency boundary.

DLQ & Replay¶

DLQ contracts: poison messages are never modified; metadata records the failures and handler stack.
Replay tools: Operator-driven jobs pull DLQ batches → run through isolation workers with circuit breakers and quarantine on repeated failure.
Observability: DLQ depth, age, and replay success rate are first-class metrics.
Retention: DLQs retained ≥ 30 days (configurable), Audit retains summary references.

Sample Events (JSON)¶

1) Tenant Created

{
  "type": "tenant.created.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C4D",
  "source": "tenant-svc",
  "specversion": "1.0",
  "time": "2025-09-29T10:15:30Z",
  "traceId": "00-7e0d...-01",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "t-123",
  "data": {
    "name": "Acme Ltd",
    "region": "EU-WEST",
    "ownerUserId": "u-789"
  }
}

2) Tenant Updated

{
  "type": "tenant.updated.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C5E",
  "source": "tenant-svc",
  "specversion": "1.0",
  "time": "2025-09-29T10:17:00Z",
  "traceId": "00-7e0d...-02",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "t-123",
  "data": {
    "changes": {
      "edition": { "old": "Free", "new": "Standard" }
    }
  }
}

3) Subscription Activated

{
  "type": "billing.subscription.activated.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C6F",
  "source": "billing-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:00:00Z",
  "traceId": "00-9a1c...-01",
  "tenantId": "t-123",
  "edition": "Enterprise",
  "schemaVersion": "1.1.0",
  "partitionKey": "t-123",
  "key": "sub-5566",
  "data": {
    "subscriptionId": "sub-5566",
    "plan": "Enterprise",
    "startDate": "2025-10-01"
  }
}

4) Usage Meter Recorded

{
  "type": "usage.meter.recorded.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C7G",
  "source": "usage-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:05:12Z",
  "traceId": "00-bb2d...-03",
  "tenantId": "t-123",
  "edition": "Enterprise",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "meter:api_calls:2025-09-29T11:05:00Z",
  "data": {
    "meter": "api_calls",
    "amount": 37,
    "windowStart": "2025-09-29T11:05:00Z",
    "windowSizeSec": 60
  }
}

5) Config Flag Updated

{
  "type": "config.flag.updated.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C8H",
  "source": "config-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:10:00Z",
  "traceId": "00-cc3e...-01",
  "tenantId": "t-123",
  "edition": "Enterprise",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "flag:betaFeature",
  "data": {
    "flag": "betaFeature",
    "value": true,
    "actor": "u-42"
  }
}

6) User Invited

{
  "type": "identity.user.invited.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C9I",
  "source": "identity-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:12:34Z",
  "traceId": "00-11aa...-01",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "u-42",
  "data": {
    "userId": "u-42",
    "email": "ada@example.com",
    "roles": ["member"]
  }
}

7) Notification Delivered

{
  "type": "notify.message.delivered.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3D0J",
  "source": "notifications-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:20:00Z",
  "traceId": "00-22bb...-05",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "msg-9012",
  "data": {
    "messageId": "msg-9012",
    "channel": "email",
    "template": "welcome",
    "status": "delivered"
  }
}

8) Audit Action Logged

{
  "type": "audit.action.logged.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3D1K",
  "source": "audit-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:22:10Z",
  "traceId": "00-33cc...-07",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "audit:2025-09-29:01",
  "data": {
    "actor": "u-42",
    "action": "TENANT_UPDATE",
    "resource": "Tenant/t-123",
    "result": "success"
  }
}

Contract Governance¶

Source of Truth: contracts/events/*.json with schema definitions and examples.
Review Process: Producer PR must include schema update + changelog; consumer teams subscribe to contract watch alerts.
Linting: CI validates envelope headers, allowed field names, and version policies.
Deprecation: Old types remain for a defined window; producers publish both old/new until consumers confirm adoption.

Observability & SLOs¶

Signal	Target
Event publishing success	≥ 99.99%
End-to-end lag p95	≤ 60s (producer commit → consumer handle)
Replay success rate	≥ 99%
DLQ age p95	≤ 15m
Duplicate handling incidents	0 (idempotent consumers)

Spans include producer.service, consumer.service, type, tenantId, key. Metrics cover topic depth, subscription lag, DLQ size/age, and handler error rates.

Failure Modes & Mitigations¶

Duplicate deliveries: Inbox + idempotent handlers; use entity key to ignore repeats.
Schema drift: Tolerant readers; canary consumers validated in pre-prod; contract tests in CI.
Bus outage/backlog: Producer backpressure, KEDA scale-out of consumers, DLQ thresholds with alerting.
Poison messages: Quarantine to DLQ on max attempts; root-cause analysis required before replay.
Cross-tenant leakage risks: Envelope validator rejects events without/with mismatched tenantId; audit every violation as SEV-1.

Solution Architect Notes¶

Prefer topic-per-domain plus subscription-per-consumer to keep ownership clear.
Treat events as write-optimized facts; resist synchronous request/response coupling.
Keep payloads lean and stable; link to resources rather than embedding large objects.
Make replay safe by ensuring handlers are pure functions over input + idempotent side effects.

Service Taxonomy & Interfaces¶

Purpose¶

Define the platform service catalog aligned with bounded contexts, including responsibilities, exposed interfaces (REST/events/webhooks), storage choices, SLO posture, and cross-cutting constraints (multi-tenancy, security, observability). Provide component-level sketches for representative services to guide implementation.

Catalog by Bounded Context¶

Context	Service	Core Responsibility	Interfaces	Persistence (default)	Notes
SaaS Core Metadata	Metadata API	Products, editions, features, entitlements; pack composition	REST (admin), Events	Azure SQL	Upstream for Config, Billing, Identity
Identity	IdP (OpenIddict)	OIDC/OAuth2, roles/scopes, federation	OIDC, REST, Events	Azure SQL	SCIM (Enterprise)
Tenant Management	Tenant API, Tenant Worker	Tenant lifecycle, residency, directory, promotions	REST, Events, Jobs	Azure SQL	Authoritative `tenantId`
Billing	Billing API, Billing Saga	Plans, subscriptions, invoices, payment provider ACL	REST, Events, Webhooks	Azure SQL	Sagas orchestrate payments
Config & Feature Flags	Config API, Flag Evaluator	Flags, edition overrides, kill switches	REST, Events	Azure SQL + Redis	Low-latency evaluation
Usage & Metering	Usage Ingest, Usage Query	Meter capture, quota checks, aggregates	Events (in), REST (read)	Azure SQL (+ cold storage)	Emits `usage.meter.recorded`
Notifications	Notify API, Delivery Worker	Email/SMS/Webhooks, templates, branding	REST, Events, Webhooks	MongoDB + Queue	Provider ACLs
Audit & Compliance	Audit Append, Audit Query	Immutable append-only log, exports	Events (append), REST (read)	Azure SQL	Long retention
AI Orchestration	AI Orchestrator, Agent Workers	Agentic flows, scaffolding, test/doc generation	REST, Jobs, Events	Blob/Queue	Guardrails & audit

Interface conventions

REST: external/administrative commands and strongly consistent queries.
Events: domain facts, outbox/inbox; topic-per-domain.
Webhooks: signed outbound notifications for tenant systems (Billing/Notify).
Jobs: KEDA/Hangfire for scheduled tasks and DLQ replayers.

Cross-Cutting Constraints¶

Multi-tenancy: All operations scoped by tenantId. Repository layer enforces RLS/guards.
Security: OIDC at edge, mTLS inside, least privilege, no static secrets (workload identity).
Observability: OTel spans/logs/metrics with tenantId, edition, traceId; dashboards per service.
SLO posture (baseline): 99.9% availability; p95 read ≤ 200 ms / write ≤ 350 ms; event lag p95 ≤ 60 s.

Exemplar 1 — Metadata API (SaaS Core Metadata)¶

Responsibilities

Manage Products, Editions, Features, Entitlements, Quotas, and Pack composition.
Act as source of truth for billing plans, entitlement catalogs, and edition overlays.
Publish canonical events when definitions change (e.g., product.updated, edition.created).

Exposed Interfaces

REST (admin/product owner):
- POST /api/metadata/products (create product)
- POST /api/metadata/editions (define edition)
- POST /api/metadata/features (define feature/entitlement)
- GET /api/metadata/products/{id}
Events (producer):
- metadata.product.created.v1
- metadata.edition.created.v1
- metadata.feature.updated.v1

Storage

Azure SQL: relational model for products, editions, features, entitlements.
Immutable history tables for auditability.

Component Sketch

flowchart LR
  API[Metadata API] --> APP[App Layer / Validation]
  APP --> REPO[Metadata Repository]
  REPO --> SQL[(Azure SQL)]
  APP --> OUTBOX[Outbox Dispatcher]
  OUTBOX --> BUS[[Service Bus]]
  style API fill:#7d5dfc,stroke:#4b33b3,color:#fff

Hold "Alt" / "Option" to enable pan & zoom

Security & Tenancy

Administrative operations require metadata.manage scope, reserved for platform/product owners.
Queries may be global (cross-tenant), but tenant-facing tokens only receive read-scoped access to permitted metadata.

Observability

Spans: metadata.createProduct, metadata.updateEdition.
Metrics: product definition latency, cache hit ratio for entitlement lookups.

Exemplar 2 — Tenant Service (Tenant Management)¶

Responsibilities

Create/activate/suspend/delete tenants.
Manage residency, directory, isolation level (pooled/schema/db).
Seed defaults (edition, flags) and emit tenant.* events.

Exposed Interfaces

REST (admin):
- POST /api/tenants (create)
- PATCH /api/tenants/{id} (suspend/resume)
- POST /api/tenants/{id}:promote (upgrade isolation level)
Events: tenant.created.v1, tenant.updated.v1, tenant.promoted.v1

Storage

Azure SQL (authoritative tenant table, residency, isolation level).
Redis (Tenant Directory cache).

Component Sketch

flowchart LR
  API[Tenant API] --> APP[Policies / Directory]
  APP --> REPO[Tenant Repository]
  REPO --> SQL[(Azure SQL)]
  APP --> OUTBOX[Outbox -> Bus]
  APP --> JOBS[Promotion Worker]
  JOBS --> SQL
  style API fill:#1f6,stroke:#0b5,color:#fff

Hold "Alt" / "Option" to enable pan & zoom

Exemplar 3 — Billing Service (with Saga Orchestrator)¶

Responsibilities

Manage plans/editions, subscriptions, invoicing.
Integrate with payment providers via Payment ACL.
Coordinate long-running billing flows (activation, retries, dunning) via Saga.

Exposed Interfaces

REST (admin/tenant):
- POST /api/billing/subscriptions (create/upgrade/downgrade)
- GET /api/billing/subscriptions/{id}
- POST /api/billing/invoices/{id}:pay
Events (producer): billing.subscription.activated.v1, .suspended.v1, .invoiced.v1, .payment.received.v1
Events (consumer): usage.meter.recorded.v1, tenant.created.v1
Webhooks (outbound): signed invoice.created, payment.failed

Storage

Azure SQL (subscriptions, invoices, ledger)
Optional blob for invoice PDFs

Component Sketch (Saga)

flowchart TB
  CMD[Billing API] --> SM[Subscription Saga]
  SM --> ACL[Payment Provider ACL]
  SM --> OUTBOX[Outbox]
  SM --> SUBREPO[Subscription Repo]
  ACL --> PAYEXT[(Payment Gateway)]
  SUBREPO --> SQL[(Azure SQL)]
  OUTBOX --> BUS[[Service Bus]]
  BUS -->|usage.meter.recorded| SM
  style SM fill:#f8a,stroke:#c06,color:#222
  style BUS fill:#357,stroke:#234,color:#fff

Hold "Alt" / "Option" to enable pan & zoom

Saga Flow (high-level)

Receive subscription.create.
Reserve plan; request payment via ACL.
On success: persist, publish subscription.activated.
On failure: retry (exponential backoff) → dunning → subscription.suspended.

Security & Tenancy

All commands must include tenantId; ACL enforces signature verification with provider.
Monetary amounts validated server-side; no client-trusted totals.

Observability

Spans: billing.saga.step.*, payment.acl.request.
Metrics: authorization approval rate, dunning success rate, event lag.

Exemplar 4 — Config & Feature Flags Service¶

Responsibilities

Store and evaluate feature flags, edition overrides, kill switches.
Provide low-latency decision APIs for UI/services.
Broadcast changes as events for cache invalidation.

Exposed Interfaces

REST:
- GET /api/config/flags/{key} (evaluate with context)
- POST /api/config/flags (create/update)
- POST /api/config/overrides (tenant/edition-specific)
Events (producer): config.flag.updated.v1, config.override.updated.v1
Events (consumer): billing.subscription.activated.v1, tenant.created.v1

Storage

Azure SQL (authoritative flag definitions, overrides)
Redis (evaluation caches with tenant scoping)

Component Sketch

flowchart LR
  API[Config API] --> APP[Evaluator/Policy Engine]
  APP --> CACHE[(Redis)]
  APP --> REPO[Config Repo]
  REPO --> SQL[(Azure SQL)]
  APP --> OUTBOX[Outbox -> Service Bus]
  style API fill:#19a974,stroke:#0e7a55,color:#fff

Hold "Alt" / "Option" to enable pan & zoom

Security & Tenancy

Mutations require config.manage + tenant_admin role.
Evaluation requires authenticated context; anonymous evaluation disabled except for public flags.

Observability

Spans: config.evaluate, config.invalidate.
Metrics: p95 evaluation latency ≤ 5 ms, cache hit%, invalidation fanout time.

Interface Outlines (selected contracts)¶

Tenant API

POST /api/tenants → 201 Created + tenant.created.v1
PATCH /api/tenants/{id} → 200 OK + tenant.updated.v1
POST /api/tenants/{id}:promote → 202 Accepted (async job)

Billing API

POST /api/billing/subscriptions → 202 Accepted (saga) + events
GET /api/billing/subscriptions/{id} → 200 OK

Config API

GET /api/config/flags/{key}?tenantId=... → 200 OK { "value": true, "reason": "tenant-override" }
POST /api/config/flags → 201 Created + config.flag.updated.v1

Solution Architect Notes¶

Metadata API acts as upstream catalog: Billing, Config, and Identity enrichments must not hardcode editions or features.
Always seed tenant entitlements from Metadata events → Config → Identity claims.
Keep Metadata auditable and append-only where possible; edition/feature definitions should be traceable across time.
Treat Metadata changes as high-risk operations; enforce RBAC + approval workflows.

Data & Storage Architecture¶

Purpose¶

Define a portable, Azure-first data architecture that supports multi-tenancy, high throughput, and auditability without locking products into one storage technology. Standardize relational vs document choices, partitioning strategies, read models, caching/search integration, and retention/archival so generated solutions can scale from trials to large enterprises with predictable cost and reliability.

Store Selection Principles¶

Concern	Default Choice	Alternatives	Rationale
System-of-record, strong consistency	Azure SQL (PostgreSQL compatible alternative)	Managed Postgres/MySQL	ACID, schema control, transactional outbox
Large/variable payloads (templates, messages)	MongoDB (optional)	Cosmos DB (Mongo API), Azure Blob	Flexible schema, document access
Caching / fast eval (flags, sessions)	Redis	In-memory per pod (with eviction)	Low-latency, TTL, pub/sub invalidation
Search / discovery	Azure AI Search (opt-in)	Elastic/OpenSearch	Full-text, facets, suggesters
Analytics / cold storage	Blob Storage (Parquet)	ADLS Gen2	Cheap, durable, columnar query via Spark/SQL
Event backbone	Azure Service Bus	Kafka	Durable pub/sub, topics, DLQs

Rule of thumb: Write models → relational, large/optional payloads → document/blob, query flexibility → search/read models, analytics/retention → blob.

Logical Data Model (core entities)¶

erDiagram
  TENANT ||--o{ USER : has
  TENANT ||--o{ SUBSCRIPTION : owns
  TENANT ||--o{ CONFIG_FLAG : configures
  TENANT ||--o{ USAGE_RECORD : generates
  TENANT ||--o{ AUDIT_EVENT : emits

  PRODUCT ||--o{ EDITION : offers
  EDITION ||--o{ ENTITLEMENT : contains
  SUBSCRIPTION }o--|| EDITION : references
  SUBSCRIPTION }o--|| TENANT : belongs_to
  ENTITLEMENT }o--|| FEATURE : grants

  USER {
    string userId PK
    string tenantId FK
    string email
    string roleSet
  }
  TENANT {
    string tenantId PK
    string name
    string region
    string isolationLevel  // pooled|schema|database
    string status          // active|suspended|deleted
  }
  PRODUCT {
    string productId PK
    string name
  }
  EDITION {
    string editionId PK
    string productId FK
    string name          // Free|Standard|Enterprise|Custom
  }
  FEATURE {
    string featureId PK
    string name
  }
  ENTITLEMENT {
    string entitlementId PK
    string editionId FK
    string featureId FK
    json   limits
  }
  SUBSCRIPTION {
    string subscriptionId PK
    string tenantId FK
    string editionId FK
    datetime startDate
    datetime endDate
    string status        // active|suspended|canceled
  }
  CONFIG_FLAG {
    string flagKey PK
    string tenantId FK
    json   value
    string scope        // global|edition|tenant|user
    datetime updatedAt
  }
  USAGE_RECORD {
    string usageId PK
    string tenantId FK
    string meter
    bigint amount
    datetime windowStart
    int windowSec
  }
  AUDIT_EVENT {
    string auditId PK
    string tenantId FK
    string actor
    string action
    string resource
    datetime occurredAt
    json   details
  }

Hold "Alt" / "Option" to enable pan & zoom

The Metadata (Product/Edition/Feature/Entitlement) drives Subscription and Config decisions; Usage feeds Billing; Audit is append-only.

Physical Partitioning & Isolation¶

Per-Context baseline

Context	Physical Store	Partition Key	Secondary Partition	Notes
Identity / Tenant / Billing / Config	Azure SQL	`tenantId`	`id` per table	RLS/tenant guards or DAL filters
Notifications (payloads/templates)	MongoDB	`tenantId`	`templateId`	Large blob-like docs, TTL indices
Usage (raw)	Azure SQL (hot) + Blob (cold)	`tenantId` + time	meter	Hot window (≤ 90 days), compaction
Audit	Azure SQL (append-only)	`tenantId` + time	actor	Immutable; export to blob for eDiscovery

Isolation levels

Pooled (default): all tenants share schema, enforced by RLS/guards; indexed on (tenantId, <business key>).
Schema-per-tenant: separate schema for hot tenants; connection factory selects schema based on Tenant Directory.
Database-per-tenant: separate DB + network isolation for enterprise/regulatory cases.

Promotion triggers

Hot partition (p99 latency, lock contention), data volume thresholds, contractual isolation, or regulatory residency.

Read Models & CQRS¶

Write models: normalized, transactional tables per bounded context.
Read models: denormalized projections built from events for UX queries and support dashboards (e.g., “tenant overview,” “subscription health,” “usage summary”).
Projections: idempotent consumers with inbox and checkpointing; rebuild on demand.
Search adapters: project to Azure AI Search indices (e.g., tenants, invoices, audit summaries) with indexer jobs and soft deletes.

Indexing & Query Patterns¶

Composite indices: (tenantId, naturalKey) as leading index across write tables.
Time-series: USAGE_RECORD (tenantId, windowStart DESC) for sliding windows; partitioned aggregation tables for hourly/daily rollups.
Audit: (tenantId, occurredAt DESC) covering actor, action, resource.
Avoid cross-tenant joins; analytics should use aggregations in a separate pipeline.

Caching Strategy¶

Cache	Scope & Key	Invalidation	TTL
Tenant Directory	`tenant:{tenantId}:dir`	on tenant.updated/promoted	5–30s
Flag Evaluation	`flag:{tenantId}:{flagKey}`	`config.flag.updated`	30–120s
Entitlement Snapshot	`ent:{tenantId}:{editionId}`	`billing.subscription.` / `metadata.`	5m
OIDC JWKs & Metadata	`idp:jwks`	rotation events	15m

Prefer cache-aside; never cache PII without encryption.
Use Redis hash keys for compact, multi-field storage and partial invalidation.

Data Retention & Archival¶

Data Class	Hot Retention	Cold Retention	Storage	Notes
Usage (raw)	90 days	2 years (Parquet)	Blob	Aggregated hourly/daily kept hot
Audit	1 year hot	7 years cold	SQL + Blob	Immutable, exportable bundles
Notifications payloads	30 days	N/A	Mongo	Store minimal PII; tokenize where possible
Invoices/PDFs	1 year hot	7 years cold	SQL + Blob	Legal/compliance governed
Config history	180 days	N/A	SQL	Versioned changes for debugging
Tokens/session	24h	N/A	Redis	No PII; rotate frequently

Retention windows are edition- and region-configurable; legal holds suspend deletion jobs.

Backup, DR & Consistency¶

SQL: PITR enabled; geo-redundant backups; weekly full + daily diff + 5-min log backups (policy baseline).
Mongo: point-in-time snapshots; validate TTL indexes.
Blob: versioning and soft-delete enabled; lifecycle rules to move to cool/archive tiers.
RPO/RTO: platform baseline RPO ≤ 15 min, RTO ≤ 4 h (overrides for Enterprise).
Consistency: outbox ensures atomic write + event; consumers are at-least-once, idempotent.

Security & Privacy¶

Encryption at rest: TDE for SQL, SSE for Blob, disk encryption for Mongo; CMEK where required.
Encryption in transit: TLS 1.2/1.3; mTLS inside cluster; Private Link for data stores.
PII minimization: store only necessary attributes; logs are redacted at sink.
Row-Level Security (RLS): preferred; otherwise enforce DAL guards and property-based tests.
Secrets: no inline secrets in tables; use Key Vault references; rotate keys ≤ 90 days.

Data Lifecycle & Governance¶

Schemas as code: migrations via EF Core/NHibernate + migration approvals (gated in CI).
CDC: used for online migrations, projections rebuilds, and promotion (pooled→schema/db).
Data quality checks: constraints + lightweight DQ jobs (nullability, referential integrity, outliers).
Change review: ADR required for breaking schema changes; contract tests validate read models.
Right to erasure: orchestrated delete with tombstones; audit holds tokenized references.

Example Physical Topology (simplified)¶

flowchart LR
  subgraph Hot Path
    SQL[(Azure SQL\nWrite Models)]
    REDIS[(Redis Cache)]
    BUS[[Service Bus]]
  end

  subgraph Projections
    CONSUMER[Projectors (Inbox/Idempotent)]
    READDB[(SQL Read Models)]
    SEARCH[(Azure AI Search)]
  end

  subgraph Cold Path
    BLOB[(Blob Storage\nParquet/Exports)]
    ANALYTICS[(Spark/SQL)]
  end

  BUS --> CONSUMER
  CONSUMER --> READDB
  CONSUMER --> SEARCH
  CONSUMER --> BLOB
  SQL <-- cache-aside --> REDIS

Hold "Alt" / "Option" to enable pan & zoom

Observability & SLOs¶

Signals: DB p95 latency, lock wait time, failed migrations, cache hit %, projection lag, index health, DLQ depth for projectors.
SLOs:
- Write p95 ≤ 350 ms; Read p95 ≤ 200 ms
- Projection lag p95 ≤ 60 s
- Cache hit ≥ 85% for flag evaluation
- Backup success 100%; restore drill quarterly

Failure Modes & Mitigations¶

Failure	Impact	Mitigation
Hot partition / noisy neighbor	Latency spikes	Promote tenant to schema/DB; shard by tenant; add covering indexes
Long-running transactions	Lock contention	Break writes into smaller batches; use optimistic concurrency
Projection backlog	Stale read models	KEDA-scale projectors; partial rebuild by tenant; prioritize critical topics
Cache stampede	Thundering herd	Request coalescing; jittered TTL; background refresh
Schema drift	Consumer breaks	Contract tests; additive changes; deprecation windows
Data corruption	Incident/rollback	PITR restore to side DB; compare via checksums; rehydrate projections

Solution Architect Notes¶

Start with pooled SQL and event-driven projections; add document/search only for proven needs.
Keep natural keys stable to enable safe migration/rebuilds.
Make retention a product setting—not hardcoded—so legal/compliance overlays can adjust.
Prefer append-only (Audit/Usage) plus compaction for analytics; avoid destructive changes in hot paths.

Messaging & Integration Patterns¶

Purpose¶

Standardize asynchronous communication across the platform using MassTransit with Azure Service Bus (ASB). Define topologies, routing conventions, retry/backoff/jitter, saga orchestration vs choreography, and compensation so services remain loosely coupled, reliable, and observable under load.

Topology & Conventions¶

Domain-first topology

Topic-per-domain: tenant-events, billing-events, config-events, usage-events, identity-events, notifications-events, audit-events.
Subscription-per-consumer: one subscription per logical consumer service (optionally per tenant segment or feature).
DLQ-per-subscription: automatic dead-letter queues with invariant retention.

Naming

Exchange/Topic: <domain>-events
Subscription: <consumer-svc>.<purpose> (e.g., billing-svc.rating, config-svc.invalidate)
Queue (commands): <svc>-cmd (optional; we favor events over commands across contexts)
Saga state store tables: <svc>_saga_<name>

Message envelopes (recap)

Required headers: type, id, time, traceId, tenantId, edition, schemaVersion, partitionKey, key (idempotency).
Partitioning: tenantId as default partitionKey to increase locality and ordering per tenant.

MassTransit Setup Patterns (C# excerpts)¶

Bus configuration (ASB + outbox)

services.AddMassTransit(x =>
{
    x.SetKebabCaseEndpointNameFormatter();

    x.AddEntityFrameworkOutbox<AppDbContext>(o =>
    {
        o.QueryDelay = TimeSpan.FromSeconds(1);
        o.DuplicateDetectionWindow = TimeSpan.FromMinutes(10);
        o.UseBusOutbox();
    });

    // Consumers, Sagas, Activities
    x.AddConsumersFromNamespaceContaining<TenantCreatedConsumer>();
    x.AddSagaStateMachine<SubscriptionSaga, SubscriptionState>()
        .EntityFrameworkRepository(r => r.ConcurrencyMode = ConcurrencyMode.Optimistic);

    x.UsingAzureServiceBus((context, cfg) =>
    {
        cfg.Host(builder.Configuration["ServiceBus:ConnectionString"]);
        cfg.MessageTopology.SetEntityNameFormatter(new DomainTopicFormatter());

        cfg.UseMessageRetry(r =>
            r.Exponential(5, TimeSpan.FromMilliseconds(200), TimeSpan.FromSeconds(10), TimeSpan.FromMilliseconds(50)));

        cfg.UseInMemoryOutbox(); // consumer-side dedupe window
        cfg.UseConcurrencyLimit(64);
        cfg.ConfigureEndpoints(context);
    });
});

Consumer template (idempotent + inbox)

public class TenantCreatedConsumer : IConsumer<TenantCreated>
{
    private readonly Inbox _inbox;
    public async Task Consume(ConsumeContext<TenantCreated> ctx)
    {
        if (!await _inbox.TryBeginAsync(ctx.Message.Id)) return; // dedupe
        try
        {
            // side-effect safe work
            await HandleAsync(ctx.Message);
            await _inbox.CompleteAsync(ctx.Message.Id);
        }
        catch (Exception ex)
        {
            await _inbox.FailAsync(ctx.Message.Id, ex);
            throw; // allow retry policy to engage
        }
    }
}

Retry, Backoff & Jitter¶

Scenario	Policy	Max Attempts	Initial Delay	Max Delay	Notes
Transient network	Exponential + jitter	5	200 ms	10 s	Default consumer policy
Rate-limited upstream	Decorrelated jitter	6	500 ms	30 s	Honor `Retry-After`
Idempotent publish	Linear	3	2 s	6 s	Outbox ensures once-per-change
External webhooks	Exponential + cap	8	1 s	5 min	Move to DLQ after cap
Payment ACL	Saga step specific	4	1 s	60 s	Backoff grows per failure stage

Rules

Retry only idempotent operations; non-idempotent steps must be guarded by saga state.
Add small random jitter to prevent thundering herds.
After retries exhausted → DLQ with full context and last exception chain.

Orchestration vs Choreography¶

Pattern	When to use	Mechanism	Pros	Cons
Choreography	Independent reactions to a fact (e.g., `tenant.created`)	Events only	Simple, scalable, low coupling	Harder to visualize global flow
Orchestration	Multi-step, long-running business process (e.g., subscription activation)	Saga coordinates steps	Centralized state, compensations explicit	Orchestrator coupling; needs strong tests

Guideline: Prefer choreography for enrichment and projections. Use sagas only for business-critical, multi-step flows with compensations (payments, migrations).

Saga Orchestration (Billing Example)¶

State machine outline

stateDiagram-v2
  [*] --> Pending
  Pending --> Authorizing : command.received
  Authorizing --> Active : payment.captured
  Authorizing --> Dunning : payment.failed(retry_exhausted)
  Dunning --> Suspended : dunning.failed
  Active --> Suspended : subscription.payment.overdue
  Suspended --> Active : payment.captured
  Active --> [*]

Hold "Alt" / "Option" to enable pan & zoom

Key design

Idempotency: Correlate by subscriptionId (saga key). Each event mutates state exactly once.
Compensation: If invoice created but payment fails → issue credit note, revert entitlements, emit billing.subscription.suspended.
Timeouts: Each step has a receive timeout; when exceeded, move to next compensating step (e.g., dunning).

Activities (MassTransit)

ReservePlanActivity → RequestPaymentActivity → ActivateEntitlementsActivity
On failure path: IssueCreditActivity → SuspendSubscriptionActivity

Compensation Patterns¶

Failure	Compensation	Notes
Payment captured but entitlements not activated	Refund/credit note, revoke token grants	Ensure idempotent credit issuance
Tenant promoted but mapping not switched	Roll back mapping; keep dual-writes; retry cutover	Feature flag `tenant.readonly` protects
Email sent to wrong template	Send corrective message, mark original as superseded	Immutable log kept in Audit
Usage over-reported	Emit `usage.adjustment` event; recompute invoice	Maintain adjustment ledger

Technique

Compensations are first-class commands/events with their own audit entries.
No “delete-and-forget”; always append corrective facts.

Error Handling & DLQ Strategy¶

Handler contract

Validate envelope invariants (tenant, trace, type) first; reject missing or mismatched context (SEV-1 if produced internally).
Side effects must be wrapped with transaction boundaries; record idempotency outcome.

Dead-lettering

Criteria: max delivery count exceeded, non-transient exceptions (validation, authorization), poison messages (schema mismatch).
DLQ payload: original message + headers + last exception + handler name + attempt count.
Replay: operator-driven tool with safe-mode (dry-run), rate limiters, circuit breakers, and quarantine on re-poisoning.

Monitoring

Metrics: subscription lag, handler error rate, DLQ depth/age, saga timeout count.
Alerts: threshold breaches trigger runbooks (scale consumers, pause producers, enable backpressure at gateway).

Integration Patterns (edge & third parties)¶

Webhooks (outbound): Signed (HMAC), retry with backoff up to 24h, idempotency via Event-Id, age limit (drop after TTL).
Inbound third-party callbacks: Terminate at Gateway; validate signature & age; enqueue to inbox queue for processing.
Payment ACL: Isolate providers’ SDKs; map transient vs permanent failures; unify errors to domain codes.

Observability¶

Spans: publish, consume, saga.step, saga.compensate, webhook.request, webhook.retry.
Attributes: tenantId, type, key, sagaId, deliveryAttempt, queue, subscription.
Logs: structured with exception chains; no PII. Include producer.service, consumer.service.
Metrics: end-to-end event lag p95, publish success rate, handler retries, DLQ age p95, saga step durations.

Performance & Scalability¶

KEDA triggers on ASB metrics (queue length, lag); scale consumers horizontally.
Use prefetch and concurrency limits tuned per handler (e.g., heavy CPU vs I/O bound).
For hot tenants, prefer per-tenant subscriptions/queues to isolate and prioritize critical customers.

Security & Tenancy¶

Tenant scoping: reject events missing tenantId; never emit cross-tenant payloads unless flagged as analytics and routed separately.
mTLS inside the cluster; ASB credentials via workload identity.
Least privilege SAS/RBAC roles per consumer/producer; rotate keys ≤ 90 days.
Data minimization: events should reference entities, not embed sensitive data.

Solution Architect Notes¶

Use outbox everywhere—it’s the linchpin of dependable messaging.
Keep sagas lean and deterministic; external calls go through activities/ACLs with clear retry/timeout semantics.
Focus on idempotency: it’s cheaper to ensure than to diagnose duplicates in production.
Make DLQ a first-class workflow with replay tooling and tight observability—assume it will be used.

AI-First & Agentic Orchestration¶

Purpose¶

Embed safe, deterministic AI assistance into the SaaS Factory to accelerate planning, scaffolding, documentation, tests, and operational hygiene—without bypassing security, change control, or human judgment. Agents propose and scaffold; humans own approval and deployment. All agent actions are audited, observable, and reversible.

Agent Roles (factory-internal)¶

Agent	Primary Outcomes	Typical Triggers	Key Outputs
Product Blueprint Agent	Turn a product idea/recipe into an initial blueprint aligned to platform patterns	New product request; edition/pack change	HLD skeleton, context map updates, ADR drafts
Service Scaffolder Agent	Create service projects from templates (API/Worker/Saga), wiring tenancy, OTel, health, outbox	“Add service” request; new bounded context	Repo branches/PRs with solution scaffold, CI pipeline YAML
Contract & SDK Agent	Generate/validate OpenAPI & event schemas, produce language SDKs	New/changed endpoints or events	`contracts/*.yaml/json`, SDK packages, contract tests
Test Generator Agent	Propose unit/contract/E2E tests; synth checks for SLOs	New feature PRs; failing SLOs	Test projects, synthetic monitors
Docs & Runbook Agent	Produce developer and operator docs; incident runbooks	New service or ADR; post-incident tasks	`docs/`, `ops/runbooks/`
Operability Agent	Create dashboards/alerts, SLOs, chaos experiments	New service onboard; SLO drift	Grafana dashboards, alert rules, Chaos experiments
Security & Compliance Agent	Enforce guardrails (SBOM, SAST/SCA, secret scans), suggest remediations	PR validation; dependency changes	Policy reports, license notices, PR comments

Optional, tenant-facing assistants (e.g., support Q&A) are separate products behind strong data-isolation and are not assumed by the factory baseline.

Skills & Tooling (curated, least-privilege)¶

Skill/Tool	Scope (allow-listed)	Notes
Template Engine	Read `/factory/templates/**`; write to feature branch	No direct writes to `main`
Repo API	Create branches/PRs; comment on PR; no force-push	PR labels must indicate “ai-generated”
Contract Linter	Validate OpenAPI/event schemas, versioning	Fails on breaking changes without ADR
MassTransit/Outbox Scaffolder	Wire messaging boilerplate	Enforces outbox/inbox and OTel
IaC Generator (Bicep/Pulumi)	Produce env-scoped stacks	Read-only cloud; deploy only via pipeline
Policy Gate Runner	SAST/SCA, license, SBOM, secret scan	Blocks PR if violations found
Observability Pack	OTel wiring, dashboard JSON, alert rules	Requires SLO metadata
Doc/Runbook Composer	Create/update Markdown and Mermaid	Must include change rationale and rollbacks

All tools are invoked through Semantic Kernel with capabilities/RBAC matching the agent role. No tool exposes secrets or production tokens to agents.

Determinism & Safety¶

Model & prompting discipline: temperature ≈ 0, pinned models, structured prompts with explicit acceptance criteria.
Deterministic artifacts: agents produce diff-minimal changes; every artifact carries an x-origin: ai/<agent>/<hash> footer.
Reproducibility: inputs (recipe, prompts, params) logged; artifacts hashed; “re-run with same seed” supported.
Human-in-the-loop: agents cannot merge; required human review + green policy gates.
Data minimization: agents read only non-PII source; no tenant data; redaction in logs.
No live mutations: agents never call production APIs; all changes flow via PR → CI → deploy.
Denylist/Allowlist: explicit denied actions (e.g., dropping DB tables); tools must guard server-side.

Orchestration Flow (Semantic Kernel)¶

sequenceDiagram
  participant PO as Product Owner
  participant Orchestrator as SK Orchestrator
  participant Blueprint as Blueprint Agent
  participant Scaffolder as Service Scaffolder
  participant Contracts as Contract & SDK Agent
  participant Tests as Test Generator Agent
  participant Ops as Operability Agent
  participant Repo as Git/PR
  participant CI as CI/CD Pipeline

  PO->>Orchestrator: "New product/service recipe"
  Orchestrator->>Blueprint: Plan HLD/ADRs from recipe
  Blueprint-->>Orchestrator: HLD diff & ADR drafts
  Orchestrator->>Scaffolder: Generate service skeleton(s)
  Scaffolder->>Repo: Open PR with code + pipelines
  Orchestrator->>Contracts: Generate/validate contracts
  Contracts->>Repo: Commit schemas + contract tests
  Orchestrator->>Tests: Add unit/contract/E2E tests
  Tests->>Repo: Commit tests
  Orchestrator->>Ops: OTel wiring, dashboards, alerts
  Ops->>Repo: Commit observability pack
  Repo->>CI: PR checks (SAST/SCA, SBOM, tests, policies)
  CI-->>Repo: Status (pass/fail)
  Note over Repo,PO: Human reviews; merge if green

Hold "Alt" / "Option" to enable pan & zoom

Guardrails & Policies¶

Identity & RBAC: agents authenticate as service principals with minimal scopes; actions are auditable and reversible.
Policy gates: PRs must pass security scans, contract tests, SLO linters, and governance checks (ADR present for major changes).
Content safety: prompt-injection filters, tool-call allowlists, and output sanitizers (no secrets, no PII).
Rate & cost controls: per-agent quotas; budget alerts; offline mode fallback.
Rollbacks: every change includes a generated rollback playbook and revert.sh script when applicable.

Observability & SLOs (AI operations)¶

Signal	Target	Rationale
PR acceptance rate (ai-generated)	≥ 80%	Indicates useful, review-ready output
Policy gate pass rate	≥ 95%	Low violation rate
Mean time to scaffold service	≤ 10 min	Responsiveness
Post-merge incident rate attributable to AI changes	0	Safety baseline
Reproducibility check (hash match)	100%	Determinism

Traces include agentId, tool, operation, repo, branch, artifactHash. Logs exclude secrets and PII.

Failure Modes & Mitigations¶

Failure	Symptom	Mitigation
Hallucinated API/contract	PR fails contract lint	Contract Agent uses ground-truth schemas; block merge
Unsafe infra change	Policy gate fails	IaC policies (OPA/Conftest/Azure Policy) block; generate safer alt
Non-deterministic output	Hash mismatch	Pin model/version; lower temperature; freeze template version
Over-scaffolding (bloat)	Large diff, unclear value	Orchestrator trims plan; human prompt to narrow scope
Tool misuse	Unauthorized API calls	Tool RBAC + server-side enforcement; audit & revoke token

Solution Architect Notes¶

Treat the Orchestrator as a planner, not a do-everything agent; delegate to small, single-purpose agents with narrow tools.
Never let agents bypass PR review or production deployment pipelines.
Prefer small, composable PRs: one agent outcome per PR for easier review and rollback.
Make agent outputs teach-able: include rationale, caveats, and links to standards so humans learn and trust the system.

Observability by Design¶

Purpose¶

Establish a uniform, always-on telemetry pipeline built on OpenTelemetry for traces, metrics, and logs. Every service, job, and gateway must emit consistent signals enriched with multi-tenant context to enable SLO monitoring, fast incident response, and data-driven improvements. Observability is a non-removable guardrail.

Telemetry Architecture¶

Pipeline (high level)

Instrumentation: .NET OTel SDK in every service (HTTP, gRPC, MassTransit, SQL, Redis, custom spans).
Export: OTLP → Collector (agent/sidecar/daemonset) →
- Traces: Tempo/Jaeger (or Azure Monitor/OpenTelemetry Distro)
- Metrics: Prometheus (scraped from Collector or services) → Grafana
- Logs: Structured JSON → Loki/Elastic/Azure Monitor Logs
Dashboards & Alerts: Grafana + Alertmanager (or Azure Monitor Alerts).
Correlation: W3C tracecontext (traceparent, tracestate) propagated end-to-end (gateway ↔ services ↔ jobs).

flowchart LR
  APP[Apps & Jobs (.NET + OTel)] --> COL[OTel Collector]
  GATE[Edge Gateway] --> COL
  BUS[Service Bus Consumers] --> COL
  COL --> TRC[(Traces)]
  COL --> MET[(Metrics)]
  COL --> LOGS[(Logs)]
  TRC --> GRAF[Grafana/Tempo]
  MET --> GRAF
  LOGS --> GRAF

Hold "Alt" / "Option" to enable pan & zoom

Required Attributes (span/log/metric labels)¶

Key	Source	Purpose
`traceId`, `spanId`	OTel	Correlation
`tenantId`	Gateway/Service	Multi-tenant scoping
`edition`	Gateway/Config	Entitlement context
`routeId` / `operation`	Gateway/Service	API and domain op naming
`service.name`, `service.version`	OTel resource	Ownership & rollout correlation
`messaging.system`, `message.type`, `message.key`	MassTransit	EDA correlation & idempotency checks
`db.system`, `db.statement`(redacted)	ADO/EF	Hot query detection
`http.method`, `http.route`, `http.status_code`	ASP.NET Core	API SLOs
`job.name`, `job.idempotencyKey`	Jobs	Job tracing
`agentId` (AI)	AI Orchestration	Agent provenance

PII is never recorded in attributes or logs. Use hashed or tokenized identifiers when needed for joins.

Span & Metric Conventions¶

Span naming

HTTP: http <VERB> <route> (e.g., http GET /api/tenants/{id})
Domain: <context>.<usecase> (e.g., billing.rate-usage)
Messaging: consume <type> / publish <type>
Jobs: job <name> (e.g., job dlq-replay)

Key metrics (Prometheus/OpenTelemetry Metrics)

API: http_server_duration_seconds (histogram), http_requests_total, http_errors_total
Messaging: consumer_lag_seconds, messages_processed_total, consumer_retry_total, dlq_depth
DB: db_client_duration_seconds, db_connections_in_use, deadlocks_total
Cache: cache_hit_ratio, cache_latency_seconds
Jobs: job_duration_seconds, job_failures_total, job_retries_total
SLO helpers: slo_availability_ratio, slo_latency_budget_burn, slo_error_budget_remaining

Logging

Structured JSON with timestamp, level, message, traceId, tenantId, edition, service, operation, exception.type, exception.stack(hash or summarized), fields{}.
Log levels: Info for state transitions, Warn for transient/backoff, Error for failed business operations, Fatal for process crash.

SLO Monitoring & Error Budgets¶

Golden paths (examples)

Auth token issuance: p95 ≤ 150 ms, availability ≥ 99.95%
Tenant onboarding (create → active): p95 ≤ 60 s, success rate ≥ 99%
Config evaluation: p95 ≤ 5 ms, availability ≥ 99.99%
Event ingestion to handling lag: p95 ≤ 60 s
Billing subscription activation (saga): p95 ≤ 2 min, failure < 0.5%

Error budget policy

If monthly SLO breaches consume >50% of budget: freeze non-urgent releases, run reliability epics.
Hard guardrails (no budget): Security, Telemetry integrity (missing tenantId/traceId), PII leakage.

Example Dashboards (sections)¶

Edge/Gateway
- Requests/sec by routeId, p95/99 latency, 4xx/5xx rates, rate-limit hits, canary weight, ejected backends.
Service Health (per context)
- API duration histogram, dependency latency (DB/Redis/HTTP), error rate, CPU/mem, pod restarts, rolling version mix.
Messaging
- Topic depth, subscription lag, consumer throughput, retry counts, DLQ depth/age, replay outcomes.
Jobs
- Success/failure, duration percentiles, retries, next runs, DLQ replayer status.
Tenancy Overview
- Top tenants by RPS, hottest partitions, promotion candidates, edition distribution, per-tenant error rates.
AI Orchestration
- PR acceptance %, gate pass rate, artifact hash reproducibility, cost usage.

Alerting (examples)¶

Alert	Condition	Severity	Playbook
API latency SLO breach	p95 `http_server_duration_seconds` > SLO for 5m	P1	Scale out, check DB latency, roll back canary
Availability dip (service)	5xx rate > 2% for 10m	P1	Trigger incident, flip traffic to stable, examine error budget
Consumer backlog	`consumer_lag_seconds` > 120s or `dlq_depth` increasing	P1	Scale consumers, inspect DLQ samples, enable backpressure
Missing tenant context	% spans without `tenantId` > 0.01%	P0	Block deployment; fix middleware; postmortem required
Cache miss storm	`cache_hit_ratio` < 70% for 10m	P2	Warm cache, check invalidation loop
Key rotation nearing expiry	cert/key days_to_expiry < 14	P2	Rotate keys, verify JWKS/certs in all environments

Sampling & Cost Controls¶

Traces: start 10–20% head sampling; tail-based sampling at Collector for error/slow traces to 100% keep.
Logs: info logs rate-limited/bursty controls; DEBUG logging disabled in production (enable via scoped feature flag for time-boxed windows).
Metrics: prefer histograms over raw timings; align bucket bounds with SLOs.
Cardinality hygiene: bound label sets (e.g., truncate user/IDs, never include raw emails).

Instrumentation Checklist (service template defaults)¶

.NET OTel AspNetCore, HttpClient, SqlClient, MassTransit instrumentations enabled.
Correlation middleware ensures tenantId, edition, traceId on all spans/logs.
Health endpoints: /healthz (liveness), /readyz (readiness); export otel.instrumentation.version.
Startup failure telemetry: if app crashes before OTel init, fall back to minimal bootstrap logger writing to stderr with correlation IDs.
Synthetic checks tagged as client=synthetic to avoid skewing user metrics.

Data Safety & Privacy¶

Redaction at source: PII scrubbers for logs; SQL statement text parameter values redacted.
Audit alignment: link audit IDs in spans for admin actions; audit remains immutable.
Access control: Observability backends require SSO + RBAC; tenant-scoped views for support, platform-wide for operators.

Failure Modes & Mitigations¶

Failure	Symptom	Mitigation
Missing `tenantId` in spans/logs	Hard to triage multi-tenant issues	Block deployment (CI check); runtime guard drops context-less requests in non-public routes
Trace flood / high cost	Collector pressure, storage bills	Tail-based sampling; dynamic sampling policies; drop noisy internal spans
High cardinality labels	Prometheus OOM, slow queries	Static label allowlist; bounds on IDs; drop/rename labels
Collector outage	Telemetry gaps	Local buffering with retry; secondary collector; alert on exporter failures
Log PII leakage	Compliance risk	PII scanners in CI + runtime; auto-redaction; SEV-0 response

Solution Architect Notes¶

Define SLOs before code; dashboards and alerts are part of the service template and PR checked.
Favor tail-based sampling to keep costs predictable while capturing the right traces.
Tie deployment safety to observability: no public endpoint until OTel signals and dashboards are verified.
Make tenant and edition the first-class dimensions in every query and board—support depends on it.

Security & Threat Modeling¶

Purpose¶

Embed security-first principles into the SaaS platform design. Apply STRIDE threat modeling across all trust boundaries, enforce Zero Trust at edge and service-to-service, manage secrets with rotation, and ensure supply-chain integrity through SBOM and artifact signing. Security is non-negotiable and integrated into CI/CD and runtime.

Threat Model (STRIDE per Boundary)¶

Boundary	Spoofing	Tampering	Repudiation	Information Disclosure	Denial of Service	Elevation of Privilege
Edge / API Gateway	Token forgery, session hijack	Request manipulation	Missing request logs	Sensitive data leakage via headers	Flooding / DDoS	Path traversal, privilege escalation
Service-to-Service	Forged service identity	Malicious message injection	Missing correlation	Overexposed events (tenant mix)	Queue flooding	Overbroad service scopes
Data Stores	Stolen credentials	SQL injection, blob tampering	No audit trails	Unencrypted data at rest	Hot partition overload	Misconfigured RLS, schema promotion abuse
CI/CD & Supply Chain	Build agent spoofing	Artifact tampering	Build logs altered	Secrets leakage in logs	Malicious PR floods pipeline	Malicious dependency injection
Tenant-Facing UIs	Phishing via iframe injection	DOM/XSS	Lack of audit for admin actions	Misconfigured CORS	Brute-force auth	Unscoped RBAC flaws
AI Agents	Prompt injection	Malicious PR diffs	No AI provenance logs	Leakage of sensitive configs	Resource abuse (cost spike)	Agent bypassing guardrails

Zero Trust Principles¶

Authenticate everything: OAuth2/OIDC at edge; mTLS for inter-service; workload identities instead of static secrets.
Authorize explicitly: RBAC/ABAC checks enforced per operation; deny by default.
Audit everywhere: Immutable logs, linked to traceId, tenantId, actor.
Segment aggressively: per-service network policies, per-tenant data isolation (RLS/DB).
Assume breach: Red team simulation, chaos-security drills, SEV-0 for detected cross-tenant leakage.

Secrets & Key Management¶

Azure Key Vault (default) for secrets, keys, certs.
Rotation policies:
- Tokens/keys ≤ 90 days
- Certificates ≤ 1 year (auto-rotate with Key Vault Certificates)
Workload identity: Replace connection strings with AAD Managed Identity or Kubernetes Workload Identity.
Zero secrets in repo: Pre-commit hooks + CI scanners enforce.

Mutual TLS & Service Mesh Posture¶

Service-to-service: All gRPC/HTTP calls secured with mTLS via service mesh (e.g., Linkerd, Istio) or YARP with TLS termination.
Certificates: Issued and rotated automatically by mesh/Key Vault integration.
Trust store: Centralized CA; only platform-issued certs accepted.
Fallback: If mesh unavailable, services must still enforce TLS 1.2/1.3 with pinned certs.

Input Validation & Hardening¶

Gateway: global request validation (size, schema, rate).
Services: contract validation (OpenAPI, JSON Schema), input sanitization.
Databases: use parameterized queries (NHibernate/EF Core), enforce RLS.
Containers: minimal base images, read-only root FS, drop Linux capabilities, seccomp/AppArmor.
Kubernetes: pod security baseline; deny privileged containers.

Supply Chain Security¶

SBOM: generated per build (CycloneDX/Syft).
Dependency scanning: SCA in CI (Dependabot/Renovate + OSS Review Toolkit).
Artifact signing: cosign for container images; verify at deploy.
Provenance: SLSA Level 3 baseline: attestations for build, source, and dependencies.
Policy gates: block deploy if unsigned or vulnerable artifact.

Security Profile per Component¶

Component	AuthN/AuthZ	Data Security	Audit	Hardening
API Gateway	OAuth2/OIDC, JWT validation, DPoP (optional)	TLS termination	Full request/response logs	WAF rules, DoS protection
Identity Service	OIDC/OAuth2, MFA for admin	SQL TDE, hashed passwords (Argon2id)	Token issuance logs	SCIM support, federation
Tenant Service	Scoped to `tenantId`	RLS enforced	Tenant lifecycle logs	Residency enforcement
Billing	Signed callbacks, PCI DSS zone	Ledger immutability, encryption	Invoice/payment audit	Saga compensations logged
Config Service	RBAC (admin only mutations)	Encrypt secrets/flags	Config change history	Kill switch validation
Usage Service	Signed ingestion events	Partitioned by tenant	Usage adjustment ledger	Throttled ingestion
Notifications	Signed webhook delivery	Encrypt templates (at rest)	Delivery logs	Provider ACL
Audit Service	Append-only	Immutable schema, retention policy	Non-repudiation	Export controls
AI Orchestration	RBAC + agent identity	Redaction in prompts	AI provenance logs	Prompt-injection filters

CI/CD Security Gates¶

Static Analysis (SAST): Roslyn analyzers, SonarQube.
Secrets scanning: GitLeaks/TruffleHog in pipeline.
Dependency scanning (SCA): alerts + fail on critical.
Container scan: Anchore/Grype/Trivy in CI.
Infra as Code scan: Checkov/OPA on Bicep/Pulumi.
PR checks: ADR presence, SBOM generated, cosign verification.

Observability & SLOs (Security)¶

Signals: token issuance latency, failed login rate, RBAC policy evaluation, key rotation lag, SBOM freshness.
SLO targets:
- Token issuance success ≥ 99.95%
- Cross-tenant access violations = 0
- Secrets exposure in repo = 0
- Key/cert expiry incidents = 0
- Critical CVEs unpatched ≤ 7 days

Failure Modes & Mitigations¶

Failure	Symptom	Mitigation
Tenant data leakage	Cross-tenant records in query/event	Block at DAL/middleware; automated test harness; SEV-0 incident response
Expired certs	Service call failures	Auto-rotation with Key Vault; monitor expiry; staged rotation tests
Supply chain attack	Dependency injection of malware	SBOM + provenance; strict registry mirror; signed artifacts only
Prompt injection (AI)	Malicious PR diffs or secret exfiltration attempts	Input sanitizers; denylist filters; tool-call allowlist
Stolen secrets	Credential replay	Rotate keys; enforce workload identity; block static credentials

Solution Architect Notes¶

Treat security violations as SEV-0—no tolerance for cross-tenant leakage or unsigned artifacts.
Bake security gates into templates so generated services are secure-by-default.
Prefer short-lived credentials + workload identity over all else.
Keep attack surface minimal: fewer protocols, minimal images, tight RBAC.
Run quarterly threat model reviews and refresh STRIDE table as system evolves.

Resilience & Reliability Policies¶

Purpose¶

Establish a resilience-first posture to ensure services remain predictable, recoverable, and observable under failures. Standardize timeouts, retries, circuit breakers, bulkheads, and fallbacks across all services, supported by chaos experiments and steady-state SLO validation.

Core Reliability Patterns¶

Pattern	Application	Notes
Timeouts	Every external call (HTTP/gRPC/DB/cache/message broker)	Default 2–5s; domain-specific overrides; no unbounded waits
Retries	Transient failures (network, 429, 5xx)	Exponential backoff + jitter; max attempts ≤ 5; idempotent only
Circuit Breakers	Downstream repeated failures	Half-open after cooldown; reject early to prevent cascades
Bulkheads	Resource partitioning	Thread pool isolation per dependency; partition tenants if noisy
Fallbacks	Non-critical paths (config, cache, search)	Return defaults/stale data; never bypass auth/billing/audit
Graceful Degradation	Feature toggles	Disable non-essential modules to preserve core flows
Idempotency	API + event handlers	Safe retries, deduplication; enforced by outbox/inbox

Policy Matrix¶

Dependency	Timeout	Retry Policy	Circuit Breaker	Bulkhead	Fallback
API Gateway → Service	3s	3x exponential (100ms → 2s)	Open after 5 failures / 30s	Route pool partitioning	Serve cached config/error envelope
Service → DB (SQL)	5s	2x linear (1s)	Open after 3 failures / 10s	Connection pool isolation per tenant	None; fail-fast
Service → Cache (Redis)	2s	3x exponential (50ms → 1s)	Open after 5 failures / 15s	Separate connection pools	Stale read (optional)
Service → Service Bus	5s	5x exponential (100ms → 5s)	Open after 10 failures / 60s	Consumer concurrency limits	Store in outbox for retry
Service → External Provider (Payments, Email)	10s	4x exponential (500ms → 30s)	Open after 5 failures / 60s	Thread pool partition	Retry + DLQ; tenant notified
Jobs (background)	Job-level SLA (e.g., 60s)	Retry up to 3x	Abort if circuit open	Queue partitioning	Reschedule; quarantine tenant batch

Chaos Engineering & Failure Injection¶

Goals

Validate that resilience patterns protect SLOs under fault injection.
Prove steady-state system remains within error budget even under chaos.

Scenarios

Network latency: inject 500 ms–2 s delays between services.
Dependency crash: kill DB/read replica; service bus outage.
Message flood: burst 10× normal tenant traffic.
Certificate expiry: simulate expired TLS/mTLS certs.
Cache poisoning: inject stale config flags.
AI agent misuse: agent suggests invalid scaffolding PRs.

Execution

Use chaos mesh/litmus in AKS, or Azure Chaos Studio.
Run during off-peak; abort if critical SLO breach > 15 min.
Record metrics: error rate, p95 latency, recovery time.

Resilience Testing & Steady-State SLOs¶

Steady-State Hypothesis: “The system continues to meet its defined SLOs under injected failures.”

Test Harness

Synthetic checks (login, tenant onboarding, subscription activation, config flag evaluation).
Run continuously; inject chaos in staging and periodically in prod (with guardrails).

SLO Verification

Auth token issuance: ≥ 99.95% success under chaos.
Tenant onboarding: p95 ≤ 60s with one DB node offline.
Config evaluation: p95 ≤ 10ms even if Redis down (fallback applies).
Billing saga: compensates gracefully within 2 min if provider unreachable.

Observability Integration¶

OTel spans mark retries, fallback paths, circuit breaker open/half-open.
Metrics: retry_attempts_total, circuit_open_total, fallback_requests_total, bulkhead_rejections_total.
Dashboards: reliability view per service, error budget burn rate, chaos experiment outcomes.
Alerts:
- Circuit breaker open rate > 5% (P1)
- Retry storm > 1000/min (P1)
- Error budget burn > 20% in 24h (P0 freeze new releases)

Failure Modes & Mitigations¶

Failure	Symptom	Mitigation
Retry storm → overload	High CPU, cascading failure	Add jitter; exponential backoff; max retry cap
Circuit breakers too aggressive	False positives; degraded UX	Tune thresholds; log open/close events
Bulkhead misconfig	Resource starvation across tenants	Separate thread pools; enforce quotas
Fallback leakage	Sensitive data exposed in defaults	Redact PII; fallback only to safe defaults
Chaos experiment causes outage	Real users impacted	Run in staging; prod chaos with guardrails and kill switch

Solution Architect Notes¶

Bake resilience policies into templates—engineers shouldn’t reinvent timeouts or retries.
Ensure compensations are explicit, observable, and auditable.
Treat chaos engineering as validation of design assumptions, not an afterthought.
Tie resilience metrics into error budget policies to guide release decisions.

Jobs & Scheduling¶

Purpose¶

Standardize background processing across the platform to handle recurring, delayed, or one-off workloads that do not fit into synchronous request/response or event-driven flows. Ensure jobs are idempotent, observable, and auditable, with UTC-based scheduling and clear operational runbooks for reliability.

Job Frameworks¶

Type	Default Framework	Notes
Short-lived, ad hoc background work	KEDA + Service Bus Queue Consumers	Elastic scaling; event-driven jobs
Recurring & delayed jobs	Hangfire (SQL/Redis backend)	Cron-based recurring jobs; dashboards
Heavy/batch jobs	KEDA scaling with containerized workers	Scale based on queue depth/metrics
Maintenance jobs	Kubernetes CronJobs	Low-frequency (e.g., nightly cleanup, backups)

Principle: Jobs are treated as first-class services with the same tenancy, security, and observability constraints as APIs.

Scheduling Strategy¶

Time Standard: All schedules defined in UTC (no local timezones to avoid drift).
Recurrence:
Hangfire CRON expressions stored under jobs/schedules/ config.
Critical jobs documented with frequency, duration SLA, and recovery playbook.
One-off jobs: Triggered via API or CLI, persisted in job store with jobId, tenantId, and status.
Drift prevention: Time sync enforced across nodes (NTP).

Idempotency & Safety¶

Idempotency key: every job carries a unique jobKey = <jobName>:<scope>:<timestamp> (e.g., invoice:tenant123:2025-10-01).
Retries: capped exponential backoff (max attempts configurable); retries logged with correlation IDs.
Poison jobs: moved to DLQ table with full error context and manual replay tooling.
Concurrency guard: distributed locks (e.g., SQL/Redis) to prevent double-execution of same job key.
Cancellation: jobs respond to cancellation tokens; long-running jobs checkpoint progress.

Job Categories (examples)¶

Category	Examples	Notes
Operational	Tenant promotion, DB migrations, cache warm-up	High-priority, operator-triggered
Business	Invoicing, subscription renewal, quota resets	Time-critical, audited
Data	Projections rebuild, DLQ replay, report generation	Idempotent; resumable
Notifications	Email/SMS campaigns, retries	Staggered sends to avoid spikes
Maintenance	Cleanup expired sessions, rotate keys, purge logs	Non-urgent; must not impact SLOs

Observability & Telemetry¶

Traces

job.schedule (scheduled at time X)
job.execute (span per execution, includes jobId, jobKey, tenantId)
job.retry and job.fail events

Metrics

job_duration_seconds (histogram per job type)
job_success_total, job_failure_total
job_retries_total
job_scheduled_next{jobName} (gauge, Prometheus style)
dlq_jobs_total, dlq_jobs_age_seconds

Dashboards

Job health view: success/failure rates, retry counts, p95 execution times.
DLQ board: job type, failure reason distribution, replay outcomes.

Operational Runbooks (baseline)¶

Job failed repeatedly
Inspect DLQ table entry.
Review trace/log context.
Fix underlying cause (data, downstream).
Trigger replay via job API (POST /api/jobs/{id}:replay).
Job backlog growing
Check KEDA scaling triggers.
Scale out workers (manual override if needed).
Verify no partition hot-spot (tenant/queue).
Clear DLQ separately.
Misconfigured CRON
Validate against jobs/schedules/ config.
Ensure UTC alignment.
Correct CRON expression; redeploy.
Stuck job
Cancel job via API.
Mark as failed with checkpoint.
Restart from last checkpoint.

Failure Modes & Mitigations¶

Failure	Symptom	Mitigation
Duplicate execution	Two workers run same job	Use distributed lock; idempotency key
DLQ growth	Too many poison jobs	Replay tooling; operator alerts
Clock drift	Jobs run at wrong time	Enforce UTC; NTP sync across nodes
Long-running jobs hang	High latency, blocking resources	Cancellation tokens; checkpoints; watchdog alerts
Thundering herd on retries	Spikes after outage	Retry with jitter/backoff; stagger scheduling

Solution Architect Notes¶

No hidden jobs: every job must be defined in config and visible on dashboards.
Always idempotent: jobs must tolerate retries and safe re-execution.
Runbooks mandatory: each recurring job must include operational steps in /ops/runbooks/jobs/.
Chaos test jobs: periodically inject job failures to validate retries and DLQ handling.

Configuration, Feature Flags & Edition Overrides¶

Purpose¶

Standardize configuration, feature flagging, and edition-specific overrides across the SaaS platform using external configuration/flag microservices (not embedded in the platform itself). Ensure safe rollouts, per-edition capability gating, kill-switches, and integration with Azure App Configuration and Key Vault for secrets.

Principles¶

Externalization: all runtime configuration, flags, and edition overlays are retrieved from Config/Feature microservices, not embedded in code.
Separation of concerns:
- Secrets → Key Vault
- Application settings → Azure AppConfig (or external Config service)
- Feature flags → Feature Flag Service (edition- and tenant-aware)
Progressive rollout: gradual exposure of new features via percentage, tenant, or edition filters.
Kill-switches: rapid disablement of features at runtime for stability or security.
Edition overlays: baseline features defined in Metadata/Entitlements, overridden dynamically by edition or tenant flags.

Integration Pattern¶

Service startup flow

Service authenticates via workload identity.
Fetches configuration values (non-secret) from AppConfig or Config microservice.
Fetches secrets from Key Vault (connection strings, API keys).
Subscribes to config change notifications (events).
Evaluates feature flags via Flag Evaluation API with context: { tenantId, edition, userId, environment }.
Uses cached flag values with TTL; invalidates on config.flag.updated events.

Feature Flag Evaluation¶

Flag structure

{
  "flagKey": "betaFeatureX",
  "default": false,
  "rules": [
    { "target": "edition:Enterprise", "value": true },
    { "target": "tenant:t-123", "value": true },
    { "target": "percentage:10", "value": true }
  ],
  "killSwitch": false,
  "updatedAt": "2025-09-29T11:00:00Z"
}

Evaluation precedence

Kill-switch (if true → disabled globally).
Tenant-specific override.
Edition-specific override.
Progressive rollout filters (e.g., percentage, region).
Default value.

Example API (Config Service external)

GET /api/flags/{flagKey}?tenantId=t-123&edition=Standard
200 OK
{
  "flagKey": "betaFeatureX",
  "value": true,
  "reason": "tenant-override"
}

Edition Overlays¶

Baseline: Metadata API defines products, editions, and entitlements.
Overlay: Config/Flag service applies dynamic rules on top (enable/disable features at runtime).
Example:
- Edition: Standard includes Feature A.
- Config overlay disables Feature A for tenant:t-999 (due to contractual restriction).
- Config overlay enables Feature B early for tenant:t-123 as part of beta.

Kill-Switches¶

Definition: global flags (kill:<featureKey>) that force-disable functionality.
Usage: emergency disablement of faulty/new features.
Enforcement: evaluated in gateway and service middleware; takes precedence over tenant/edition overrides.
Auditability: all kill-switch flips logged in Audit service (audit.action.logged.v1).

Rollout Flows¶

Progressive rollout (percentage-based)

Flag created as default=false.
Rule applied: percentage=5% for target feature.
Gradually increased to 25%, 50%, 100%.
Rollout metrics tracked: error rates, adoption, SLO impact.
Kill-switch available at every stage.

Targeted rollout (tenant or edition-based)

Enterprise tenants get early access to new feature.
Beta tenants opt-in by contract.
Free/Standard remain unaffected until GA.

Operational Runbook

Document rollout strategy in /ops/runbooks/flags/<flagKey>.md.
Define: target tenants, monitoring metrics, rollback steps.

Observability¶

Spans: config.fetch, config.evaluate, flag.evaluate.
Metrics:
- config_fetch_latency_seconds (p95 ≤ 50ms)
- flag_eval_latency_seconds (p95 ≤ 5ms, Redis cached)
- config_update_events_total
- flag_killswitch_activated_total
Dashboards: flag evaluation success rate, cache hit ratio, rollout coverage %.
Alerts: kill-switch activation triggers P1 audit/notification; flag evaluation failures > 0.1% → P1 incident.

Security & Tenancy¶

Secrets: only pulled from Key Vault with workload identity; rotated automatically.
Config/Flags: always evaluated with tenantId; anonymous calls disallowed.
Multi-tenant enforcement:
- RLS at config service.
- Tenant isolation in flag definitions.
- Edition scoping enforced at query layer.

Failure Modes & Mitigations¶

Failure	Symptom	Mitigation
Config service outage	Features unavailable / mis-evaluated	Local cache with TTL; stale-while-revalidate
Kill-switch delay	Feature not disabled fast enough	Push invalidation events; force cache bust
Overlapping overrides	Confusing feature state	Precedence rules enforced; audit trail required
Flag drift between envs	Inconsistent tenant experience	Config sync job; environment drift monitor
Secrets leakage	PII in config	CI scans; Key Vault only; deny secrets in AppConfig

Solution Architect Notes¶

Treat Config & Feature Flags as external microservices—our SaaS platform integrates with them, but does not own them.
AppConfig + Key Vault are integration points for base values/secrets, but dynamic logic lives in Config Service.
Bake flag evaluation middleware into service templates so every new service is “flag-aware.”
Document rollout/rollback strategies per flag—never release without a defined kill-switch.

API Design Standards & Versioning Policy¶

Purpose¶

Provide factory-wide API conventions so every generated product exposes a predictable, secure, and evolvable interface. External APIs are REST-first; internal service-to-service can additionally use gRPC where efficiency or streaming is beneficial. Contracts are spec-first (OpenAPI/Protobuf) with strong governance, multi-tenant invariants, and clear deprecation windows.

Protocol Posture¶

Audience	Protocol	Usage
External (public)	REST/HTTP	CRUD and task-oriented endpoints; JSON over HTTPS; HATEOAS not required; Problem+JSON.
Internal (services)	gRPC (opt)	High-throughput, low-latency calls; streaming; strictly inside the mesh with mTLS.
Async	Events/Webhooks	Event-driven facts via Service Bus; signed webhooks for tenant integrations.

Rule: prefer async events over synchronous cross-context calls. Use gRPC only when you need tight request/response between trusted services.

Versioning Strategy¶

Resource/API versions

URI versioning for REST (public): /v1/..., /v2/....
Content negotiation (optional): Accept: application/vnd.connectsoft.v1+json.
gRPC: package versioning (package billing.v1;) and additive field numbering.

Compatibility rules

Minor, additive changes (new fields) do not bump major.
Breaking changes → new major (v2) and new route set.
Sunset policy: minimum 12 months support for the previous major after v{N+1} GA (Enterprise can contract longer).
Deprecation headers: Deprecation, Sunset, Link: <changelog>; rel="deprecation" emitted by Gateway for deprecated versions.

Resource Modeling & Naming¶

Plural nouns for collections: /tenants, /subscriptions, /flags.
Hierarchy only where ownership is strict: /tenants/{tenantId}/subscriptions/{id}.
Actions-as-subresources (no verbs in paths):
- POST /subscriptions/{id}:cancel
- POST /tenants/{id}:promote
Idempotency for unsafe operations: clients send Idempotency-Key (UUID). Server stores outcome for at least 24h.

Pagination, Filtering, Sorting¶

Pagination

Cursor-based by default: ?cursor=<opaque>&limit=50
Response:

{
  "items": [ /*...*/ ],
  "pageInfo": { "nextCursor": "eyJhIjoi...\"", "hasNextPage": true, "count": 50 }
}

Filtering

?filter=field1:eq:value1,field2:lt:10 (simple RSQL-like), or discrete params for common fields.
Disallow filtering on sensitive/PII fields.

Sorting

?sort=+createdAt,-name (stable secondary sort by id).

Total counts

Avoid for hot paths; expose separate HEAD or /stats if needed.

Standard Request & Response Conventions¶

Headers (required/standardized)

Inbound: Authorization: Bearer <JWT>, X-Tenant-Id (edge injects if not provided), X-Request-Id/Traceparent.
Rate limits: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.
Idempotency: Idempotency-Key (unsafe methods).
Caching: ETag, If-None-Match, Cache-Control for GETs where safe.

Error envelope (Problem+JSON)

{
  "type": "https://docs.connectsoft.cloud/problems/tenant-ambiguous",
  "title": "Tenant context is ambiguous",
  "status": 400,
  "traceId": "00-7e0d...-01",
  "tenantId": "t-123",
  "detail": "X-Tenant-Id header conflicts with token claim",
  "errors": [ { "code": "TENANT_MISMATCH", "path": "$.headers.x-tenant-id" } ]
}

Always include traceId, and when present, tenantId.
Avoid PII in detail.

Validation

Use JSON Schema derived from OpenAPI components; reject unknown fields when strict=true flag is enabled (default for internal/gRPC).

Tenancy & Security Invariants¶

Every authenticated request must resolve a single tenant; ambiguous → 400.
Authorization: scope + role check at edge; resource-level ABAC in service.
Data access is tenant-scoped; cross-tenant resources are never addressable by ID alone.
PII hygiene: redact in logs; never echo secrets.

REST Resource Examples (concise)¶

Tenants

POST   /v1/tenants
GET    /v1/tenants/{tenantId}
PATCH  /v1/tenants/{tenantId}
POST   /v1/tenants/{tenantId}:promote        // isolation level upgrade

Billing

POST   /v1/subscriptions                     // Idempotent; requires Idempotency-Key
GET    /v1/subscriptions/{id}
POST   /v1/invoices/{id}:pay
GET    /v1/subscriptions?cursor=&limit=&sort=

Config & Flags

GET    /v1/flags/{key}?context=user:...      // evaluation; no anonymous
POST   /v1/flags                              // admin only
POST   /v1/overrides                          // tenant/edition overrides

gRPC Internal Standards (optional)¶

Service boundaries mirror REST resources, but optimized for chatty or streaming ops (e.g., usage ingest):
- Package: usage.v1, Service: Ingest with rpc Append(stream Record) returns (AppendAck)
Auth: mTLS + per-RPC auth via authorization metadata; include x-tenant-id.
Backwards compatibility: fields never renumbered; only add new optional fields.

Caching & Conditional Requests¶

Safe GETs return ETag; clients may send If-None-Match → 304 for unchanged resources.
TTL hints via Cache-Control: private, max-age=30 for non-sensitive tenant reads.
Do not cache mutable or sensitive resources (billing actions, identity).

Rate Limiting & Quotas¶

Enforced at gateway; mirrored by services to prevent bypass.
Headers communicate quota state (see above).
429 includes Retry-After and X-RateLimit-Policy (edition/plan identifier).

Idempotency & Concurrency¶

Unsafe methods (POST, PATCH, DELETE) accept Idempotency-Key; server guarantees exactly-once effect within 24h window.
Optimistic concurrency: If-Match: <ETag> required for updates; stale tag → 412 Precondition Failed.

OpenAPI & SDKs¶

Source of truth: contracts/openapi/<service>-v{N}.yaml.
Linting rules:
- Must document auth requirements, tenant context, error responses (4xx, 5xx).
- Pagination schema reusable (PageInfo).
- Examples for each operation with tenantId and traceId.
SDK generation: Typescript/.NET Python as needed; versioned per API major.

Deprecation & Sunset Workflow¶

Mark endpoints/fields as deprecated in OpenAPI (deprecated: true) and docs.
Notify via tenant/admin changelog events and status page.
Emit deprecation headers from edge for affected routes.
Observe usage via logs/metrics per version.
Remove only after window closes and usage < threshold (e.g., <1% over 30 days).

Observability & SLOs¶

Traces

Span names: http <VERB> <route>; attributes: tenantId, edition, routeId, version, idempotencyKey (when present).

Metrics (per route & version)

http_requests_total, http_server_duration_seconds (histogram), http_4xx/5xx_total, rate_limit_hits_total.

Baselines

p95 read ≤ 200 ms, write ≤ 350 ms; success ≥ 99.9%; schema/contract lint pass 100% in CI.

Risks & Trade-offs¶

Risk/Decision	Trade-off	Mitigation
URI versioning proliferation	More routes to maintain	Contract governance, codegen, and routing templates
Cursor pagination complexity	Harder for simple clients	Provide helper SDKs and compatibility with `offset`
Strict idempotency semantics	Server storage for keys/outcomes	TTL window; compact storage; GC jobs
Long deprecation windows	Slower platform evolution	Telemetry-driven removal; enterprise opt-in support
Dual REST/gRPC posture	Two stacks to maintain internally	Use gRPC selectively; concentrate shared interceptors

Solution Architect Notes¶

Treat contracts as code: PRs that change APIs must update OpenAPI/Protobuf, examples, and consumer contract tests.
Keep tenant context explicit; never infer from payloads alone.
Prefer event notifications (webhooks/events) to polling—then design REST for discovery & control, not data streaming.
Document SLOs and error budgets at the operation level for critical routes (auth, onboarding, billing).

Webhooks & Extensibility¶

Purpose¶

Provide a generic, reusable webhook capability for all SaaS solutions produced by the factory. Webhooks extend the platform to tenant systems and partner integrations without tight coupling, using signed deliveries, durable retries with ageing, idempotency, replay, and a self-service management API. The design is multi-tenant, edition-aware, and can be embedded as a shared microservice in other platforms.

Architecture Overview¶

Components

Webhook Manager (API): Tenant-scoped CRUD for subscriptions, secrets, filters, delivery policies, and replay requests.
Webhook Dispatcher (Worker): Consumes domain events, applies routing rules, signs payloads, and performs resilient delivery with backoff and DLQ.
Subscription Store: Tenant-scoped definitions (endpoint URL, secret, filters, version, status).
Delivery Store: Durable log of attempts/outcomes for replay, auditing, and idempotency.
Schema Catalog: Canonical schemas (by event type/version) exposed for consumer validation/generation.

Event Flow

Domain services publish events to the event bus.
Dispatcher projects them against active subscriptions (per tenant) and enqueues delivery jobs.
Delivery attempts use HMAC-signed HTTP calls; failures retry with exponential backoff + jitter until age limit reached.
After max age/attempts, jobs are moved to DLQ; operators or tenants can replay.

flowchart LR
  E[Domain Events (Service Bus)] --> D[Webhook Dispatcher]
  D -->|match filters| Q[Delivery Queue]
  Q -->|http POST| R[Receiver Endpoint]
  R -->|2xx| DS[Delivery Store]
  R -->|>=400| Q
  D -->|exhausted| DLQ[Dead-Letter Queue]
  M[Webhook Manager API] --> S[Subscription Store]
  M --> DS
  M -->|replay| Q

Hold "Alt" / "Option" to enable pan & zoom

Delivery Contract¶

HTTP Method: POST Content-Type: application/json (UTF-8) Headers (required):

Webhook-Id: stable UUID for the subscription (not the tenant)
Webhook-Event-Id: unique delivery id (UUID)
Webhook-Event-Type: canonical type (e.g., tenant.created.v1)
Webhook-Event-Time: RFC3339 timestamp
Webhook-Signature: sha256=<hex> (HMAC over {timestamp}.{body})
Webhook-Timestamp: UNIX seconds used in signature
Webhook-Retry-Count: attempt number
Webhook-Tenant-Id: source tenant
Traceparent: W3C trace context for end-to-end correlation

Payload (envelope):

{
  "id": "9f4a2d64-7c8a-4a25-9a0a-0f0e9b5dcb31",
  "type": "tenant.created.v1",
  "occurredAt": "2025-09-29T09:30:00Z",
  "tenantId": "t-123",
  "schemaVersion": "1",
  "data": {
    "name": "Acme",
    "edition": "Standard"
  }
}

Security

Signature secret is per subscription and rotates without downtime (see Rotation).
Receivers must validate timestamp freshness (e.g., ≤ 5 minutes drift) and recompute HMAC.

Subscription Model (multi-tenant)¶

Field	Description
`subscriptionId`	Stable UUID
`tenantId`	Owner tenant (scopes visibility & quota)
`endpointUrl`	HTTPS only; optional mTLS
`secret`	HMAC secret (active)
`nextSecret`	Staged secret for rotation
`enabled`	`true/false`
`eventTypes`	Allow-list (supports wildcards: `billing.*`)
`filters`	Attribute predicates (e.g., `edition in ['Enterprise']`)
`deliveryPolicy`	Retries/backoff, timeouts, concurrency, ageing window
`rateLimit`	Per-subscription throttle (req/min)
`version`	Schema version preference (default: latest compatible)

Delivery Semantics¶

At-least-once. Receivers must be idempotent.
Idempotency-Key: Webhook-Event-Id provided; consumers should persist outcomes keyed by it.
Retries: Exponential backoff with jitter (e.g., 1s, 4s, 16s, 64s, … up to policy cap).
Ageing: Stop retrying when age limit reached (default 24h); event moved to DLQ.
Replay: Tenants/operators can request targeted replays by eventId, time window, or filter (respects age & legal constraints).
Ordering: Not guaranteed across topics; best-effort per subscription if the receiver returns promptly. For strict order, recommend event-sourced consumer pattern or per-aggregate subscriptions.

Webhook Management API (examples)¶

Create subscription

POST /v1/webhooks/subscriptions
{
  "endpointUrl": "https://acme.example.com/hooks",
  "eventTypes": ["tenant.created.v1", "billing.subscription.*"],
  "filters": ["edition in ['Enterprise','Standard']"],
  "deliveryPolicy": { "timeoutMs": 5000, "maxAttempts": 10, "maxAge": "24h" },
  "rateLimit": { "rpm": 600 }
}

Rotate secret (staged)

POST /v1/webhooks/subscriptions/{id}:rotate
// Generates nextSecret; both secrets valid for N hours

Replay

POST /v1/webhooks/deliveries:replay
{
  "subscriptionId": "…",
  "from": "2025-09-28T00:00:00Z",
  "to":   "2025-09-28T23:59:59Z",
  "filters": ["type like 'billing.%'"]
}

List deliveries

GET /v1/webhooks/deliveries?subscriptionId=...&cursor=&limit=

Security Model¶

HTTPS required, optional mTLS for high-trust partners.
HMAC-SHA256 signatures; body canonicalized as raw bytes.
Secret Rotation: dual-secret window (active + next). Webhook-Signature includes key id when dual-valid to disambiguate.
IP allow-lists (optional) and DNS pinning (resolve at delivery) to reduce SSRF risks.
Payload hygiene: No PII unless explicitly enabled by tenant policy; sensitive fields can be tokenized.

Edition & Quota Integration¶

Free: limited subscriptions (e.g., 1), low RPM, short retention.
Standard: moderate RPM, standard retention, replay allowed.
Enterprise: higher RPM/concurrency, longer retention, mTLS support, dedicated IPs.
Custom: configurable caps per product/contract.

Quota enforcement happens in Webhook Manager and Dispatcher; gateway surfaces headers:

X-WebhookRate-Limit, X-WebhookRate-Remaining, X-WebhookRate-Reset.

Observability¶

Tracing: Each delivery attempt creates a span: webhook.deliver; attributes: tenantId, subscriptionId, endpointHost, attempt, statusCode, latencyMs.
Metrics:
- webhook_deliveries_total{status=2xx/4xx/5xx}
- webhook_retry_attempts_total
- webhook_latency_ms (histogram)
- webhook_signature_failures_total
- webhook_replay_requests_total
Logs: Structured, with redacted URLs/secrets, and correlation to source domain event via sourceEventId.

Failure Modes & Playbooks¶

Failure	Symptom	Action
Signature mismatch	401/403 at receiver	Verify timestamp skew, secret, canonicalization; rotate secret if suspected leak
Persistent 5xx	DLQ growth	Backoff increase, notify tenant, open incident with receiver owner
Slow receiver	Timeouts, high latency	Apply per-subscription timeout; recommend receiver queueing
Endpoint drift	NXDOMAIN/SSL error	Suspend subscription; alert admin; require endpoint verification
Replay storm	Burst load on receiver	Rate limit replays; allow windowed replay; communicate schedule

Consumer Guidance (best practices)¶

Verify signatures and timestamp freshness before processing.
Idempotent handlers using Webhook-Event-Id; store processed IDs for TTL ≥ ageing window.
Respond fast (≤ 2s) and offload heavy work to your own queue.
Return 2xx only when processing is safely enqueued; otherwise 4xx/5xx to trigger retry.
Keep endpoints stable; use staged rotation flows for secrets and URL changes.

Extensibility & Reusability¶

The service is product-agnostic: event routing is driven by type, filters, and tenant policy, not by hardcoded domains.
Supports multiple schema versions per event type; consumers pick via subscription preference.
Pluggable signing providers (HMAC default; optional asymmetric signing for advanced partners).
Integrates easily into other platforms: expose Manager API, Dispatcher wired to their bus, and reuse Schema Catalog.

Solution Architect Notes¶

Keep the Dispatcher stateless; rely on Delivery Store for idempotency and state.
Treat webhooks as untrusted egress: validate destinations, cap concurrency, and isolate network egress where possible.
Always surface tenant and subscription context in traces and logs to debug quickly.
Encourage partners to adopt schema validation using the exposed Schema Catalog, and provide SDKs where adoption is strategic.

Admin Console & Tenant Portal¶

Purpose¶

Offer a reusable, multi-tenant UI that any generated SaaS product can embed or extend. The Admin Console and Tenant Portal provide self-service administration, usage/metering visibility, feature toggle management, audit exploration, and embedded observability. They enforce RBAC/ABAC, respect edition entitlements, and present a consistent UX across products.

Audience & UX Principles¶

Tenant Administrator: configure identity/SSO, manage users/roles, set flags/overrides, view invoices and audit logs.
Billing & Finance Operator: review subscriptions, invoices, usage reports.
Support Agent: read-only diagnostics, tenant health, and audit traces (PII-redacted).
End User: minimal profile/settings; app-specific modules only when permitted.
Platform Operator (SRE): operator console views (environment health, tenant routing, migration status).

UX principles: least privilege, edition-aware UI (hide/disable), zero-trust by default (no local admin bypass), explainability (why a control is disabled), fast path to observability (one click to traces/logs filtered by tenant).

UI Architecture & Embedding¶

Composition approach

Shell + Modules architecture (microfrontends optional). Shared shell provides auth, navigation, theming, telemetry, and locale.
Module types:
Core Modules (always present): Tenant Management, Users & Roles, Usage & Billing, Feature Flags & Overrides, Audit Explorer, Webhooks, Security & SSO, Observability.
Product Modules (pluggable): domain-specific pages contributed by each bounded context (e.g., Projects, Catalog).

Embedding patterns

Standalone (hosted by the platform) or Embedded (iframe/Web Component) inside product portals.
RBAC-aware Navigation: menu items and routes emitted by module manifests and filtered by role/entitlements.

flowchart LR
  Shell[Console Shell] --> Auth[OIDC Auth]
  Shell --> Nav[RBAC Navigation]
  Shell --> Telemetry[OTel Web SDK]
  Shell --> Core[Core Modules]
  Shell --> Product[Product Modules]

  Core --> Tenants[Tenant Management]
  Core --> Users[Users & Roles]
  Core --> Flags[Feature Flags & Overrides]
  Core --> Billing[Usage & Billing]
  Core --> Audit[Audit Explorer]
  Core --> Hooks[Webhooks]
  Core --> SSO[Security & SSO]
  Core --> Obs[Embedded Observability]

Hold "Alt" / "Option" to enable pan & zoom

RBAC Mapping (illustrative)¶

Area / Page	tenant_admin	billing_admin	support_agent	member	operator
Tenant Profile & Residency	✓	–	✓ (read)	–	✓
Users & Roles	✓	–	–	–	✓ (read)
Feature Flags & Overrides	✓	–	–	–	✓ (read)
Usage & Billing	✓ (read)	✓	✓ (read)	–	✓ (read)
Audit Explorer	✓ (PII-masked)	–	✓ (PII-masked)	–	✓
Webhooks	✓	–	–	–	✓ (read)
Security & SSO	✓	–	–	–	✓ (read)
Embedded Observability	✓	–	✓ (read)	–	✓
Product Modules (domain)	✓	–	read if granted	read if granted	✓ (read)

ABAC examples: region constraint for PII views (tenant.region == user.region), data-class gates for audit payloads.

Key Modules & Responsibilities¶

Tenant Management

View/update tenant profile, residency, isolation level (with guardrails).
Lifecycle actions: suspend/resume, export, scheduled deletion workflow.
Connection/partition info (read-only), migration status.

Users & Roles

Invite users, assign roles, JIT provisioning visibility (SCIM for Enterprise).
Session overview, forced logout, MFA settings (when supported).

Feature Flags & Overrides

Browse flags by context; evaluate flag in a what-if panel (tenant/user/edition).
Create tenant overrides with audit reason; kill-switches surfaced prominently.
Safe rollout helpers (percentage/targeting where applicable).

Usage & Billing

Usage charts (requests, storage, compute) with cursor pagination for raw events.
Subscription plan, invoice history, payment method (link-out to provider portal via ACL).
Quotas, rate limits; “close to limit” alerts and upgrade CTAs (edition-aware).

Audit Explorer

Search by actor, action, resource, traceId, and date window.
PII-aware rendering (tokenized values; reveal requires extra privilege).
Export signed bundles (JSON/CSV) with chain of custody metadata.

Webhooks

CRUD subscriptions, rotate secrets (dual-key), replay deliveries (windowed).
Delivery attempts table, filter by status code, trace to source event.

Security & SSO

Configure SSO (OIDC/SAML), SCIM endpoints, conditional access toggles (where provided via federation).
API keys/pats (if allowed), with strict scopes and expirations.

Embedded Observability

Embed pre-filtered dashboards (tenantId, edition).
Trace search (“Follow a request”) and log tail with PII redaction.
Error budget widgets aligned with tenant edition.

Data & Integration Flows¶

sequenceDiagram
  participant UI as Admin Console
  participant GW as API Gateway
  participant TEN as Tenant Service
  participant CONF as Config/Flags
  participant BILL as Billing
  participant AUD as Audit
  participant OBS as Observability

  UI->>GW: GET /v1/tenants/{id}
  GW->>TEN: AuthZ + fetch tenant
  TEN-->>GW: tenant profile (+residency, isolation)
  GW-->>UI: profile

  UI->>GW: GET /v1/flags?tenantId=...
  GW->>CONF: list flags + overrides
  CONF-->>GW: flags
  GW-->>UI: flags

  UI->>GW: GET /v1/usage?window=30d
  GW->>BILL: aggregate usage
  BILL-->>GW: usage series
  GW-->>UI: charts

  UI->>GW: GET /v1/audit?traceId=...
  GW->>AUD: query
  AUD-->>GW: results (PII-masked)
  GW-->>UI: render

  UI->>OBS: iframe embed with signed view token (tenant filter)

Hold "Alt" / "Option" to enable pan & zoom

Tenancy, Security & Privacy¶

Tenant context mandatory: every UI call carries X-Tenant-Id; ambiguous → blocked at edge.
Edition awareness: modules/features hidden or disabled when not entitled; tooltips explain required edition.
PII hygiene: redaction by default; View Sensitive Data requires elevated role + just-in-time approval with audit.
Audit everything: admin actions, flag changes, replay requests, SSO changes—all emit audit events with actor and reason.
Session security: short-lived access tokens with silent refresh; session fixation and clickjacking protections; CSP locked to allowed origins.

Observability & SLOs¶

Front-end telemetry: OTel Web SDK emits ui.action and ui.view spans; correlates with backend via traceparent.
User-centric SLOs
- Console availability ≥ 99.9%
- p95 nav-to-data render ≤ 500 ms (cached) / 1500 ms (cold)
- Feature toggle apply-to-effective latency ≤ 10 s
Dashboards: per-tenant/module latency, error rates, UI Web Vitals (LCP/CLS), and admin risky-action counters.

Accessibility, Internationalization & Theming¶

Accessibility: WCAG 2.1 AA baseline; keyboard-only navigation; screen-reader labels on critical controls.
i18n: message catalogs; locale from user profile; number/date formatting per locale.
Theming: light/dark + tenant branding (logo/accent); ensure contrast ratios remain compliant.

Performance & Resilience¶

Code splitting per module; prefetch on likely nav paths.
Client-side caching with ETags; optimistic updates for non-critical settings.
Graceful degradation: if observability embedding is down, show cached summaries with a banner.

Extension Points¶

Module Manifest: JSON that declares routes, required scopes, and nav entries; the shell loads and gates them.
Action Hooks: before/after save for flags, webhooks, and SSO to inject product-specific validation.
Deep Links: shareable URLs encoding tenant context and filters (/audit?tenant=t-123&trace=...).

Solution Architect Notes¶

Keep policy enforcement at API boundaries; the UI only reflects decisions, it doesn’t make them.
Treat Feature Flags as UX surface for controlled rollout; every toggle change must be observable and reversible.
Make embedded observability tenant-safe: scoped tokens, short TTL, and server-side filtering.

Data Lifecycle, Privacy & Compliance¶

Purpose¶

Define how data is classified, minimized, protected, retained, and erased across the platform. Establish privacy-by-design controls, tenant-aware retention/TTL, encryption standards, data-subject request (DSR) flows, and auditability. Provide compliance overlays (e.g., GDPR, HIPAA, SOC 2) that products can enable per tenant/edition or per industry pack—without code changes.

Data Classification¶

Class	Examples	Handling Rules	Storage Defaults	Access
Public	Docs, status messages	Cacheable; no PII	Blob (public-read via signed URLs if needed)	Unrestricted
Internal	Feature metadata, templates	No external exposure; redact in logs	SQL/Blob	Staff with role gating
Confidential	Tenant config, usage metrics	TLS in transit, TDE at rest; masked in logs	SQL + Redis cache (TTL)	Tenant-scoped; RBAC/ABAC
PII	Name, email, IP, user identifiers	Minimize, pseudonymize where possible; strict audit	SQL (column-level protection)	Least privilege, purpose-bound
Sensitive PII	SSN, passport, health data	Tokenize or encrypt at field level; strong purpose limits	SQL with field-level encryption	Restricted roles; extra approvals
Secrets	API keys, webhooks secrets	Never in logs; KV only; rotate ≤ 90d	Key Vault, no app DB	Break-glass only
Audit/Legal	Audit events, access logs	Append-only; WORM retention; exportable	SQL/Blob (immutability options)	Operators/legal with controls

Principles: collect the minimum, retain the minimum, process for stated purposes only, never log raw PII or secrets.

Data Minimization & Privacy Patterns¶

Pseudonymization: replace direct identifiers with stable tokens; keep mapping in a protected table or vault.
Selective collection: edition/pack toggles gate collection of optional attributes (e.g., precise location).
PII scrubbing in telemetry: centralized log processors reject events containing PII markers; drop or tokenize.
Scoped queries: APIs expose aggregated/filtered views by default; raw exports require elevated role + audit reason.

Retention & TTL Policies (defaults)¶

Dataset	Default TTL/Retention	Disposal Method	Notes
Tenant profile/config	Active + 90 days after deletion	Hard-delete after grace	Soft-delete window for recovery
Usage raw events	90 days	Roll-up + delete raw	Keep aggregates 13 months
Audit log	7 years (configurable)	WORM archival then purge	Legal hold overrides TTL
Webhook deliveries	30 days	Purge	Enterprise can extend (e.g., 90 days)
Job histories	30 days	Purge	Metrics retained separately
Backups	35 days rolling	Expire snapshots	Encrypted backups only

Retention can be overridden by tenant contract (e.g., healthcare pack) and by regulatory overlays. All overrides are traceable.

Encryption & Key Management¶

In transit: TLS 1.3 everywhere; mTLS inside the cluster/mesh.
At rest: TDE for SQL/Blob; CMEK (customer-managed) optional per tenant/region.
Field-level: Sensitive PII encrypted with per-field keys or envelope encryption; keys in Key Vault.
Rotation: keys/secrets rotated ≤ 90d; certs tracked with expiry alerts; dual-key windows during rotation.
No plaintext secrets in app configs or logs; secrets read at runtime via workload identity.

DSR (Data Subject Requests) & Erasure Flows¶

Supported requests: access/export, rectification, deletion (erasure), restriction (hold), portability.

sequenceDiagram
  participant R as Requestor (Tenant Admin/User)
  participant AC as Admin Console
  participant GW as Gateway
  participant DSR as DSR Orchestrator
  participant SVC as Domain Services
  participant AUD as Audit

  R->>AC: Submit DSR (access/erasure)
  AC->>GW: POST /v1/dsr
  GW->>DSR: create_case(tenantId, subjectId, type)
  DSR->>SVC: fanout(workflow tasks per context)
  SVC-->>DSR: status/proofs (export bundle, erasure markers)
  DSR-->>AC: progress events
  DSR->>AUD: record actions (who/what/when/why)
  AC-->>R: Completion & evidence package

Hold "Alt" / "Option" to enable pan & zoom

Erasure rules

Soft-delete first, then scheduled hard-delete after grace window unless legal hold.
Propagate to caches, search indexes, and derived stores; rebuild affected aggregates.
Audit trail remains (immutable), but direct identifiers are tokenized or replaced with non-reversible references.

Compliance Overlays (enable per tenant/product)¶

Overlay	Scope	Key Controls
GDPR	EEA/EU residents	DSR flows, consent records, data minimization, cross-border transfer safeguards
HIPAA	PHI in US healthcare	BAAs, audit controls, access logs, encryption, breach notifications
SOC 2	Trust Services Criteria	Change management, access control, incident response, monitoring
PCI DSS	Payments	No PAN storage or tokenize; provider ACL; quarterly scans
Data Residency	Contractual	Region-locked storage and processing, DR within region pair only

Overlays are policy bundles (configs + CI checks + runtime gates) and can be attached by edition/industry pack.

Policy Enforcement Points¶

Layer	Enforcement
Edge/Gateway	Consent headers, tenant residency checks, geo-fencing, edition policy hints
Service APIs	RBAC/ABAC checks for data classes; PII-safe serializers; explicit export endpoints
Domain Logic	Purpose-bound processing; blockers for collecting unnecessary attributes
Repositories	RLS/tenant guards; column protection; query allow-lists
Pipelines/CI	Static scans for PII in code/tests; schema linters; policy-as-code
Observability	PII detectors in logs; trace attributes marking data class; audit correlations

Auditability & Evidence¶

Immutable audit: append-only events for access, changes, exports, erasures, consent updates (actor, reason, traceId).
Evidence bundles: machine-readable (JSON) + human-readable (PDF/CSV) export for DSR/compliance audits.
Chain of custody: signed exports with hash manifest; storage with retention & legal hold.

Consent records: versioned, tenant-scoped; changes audited with reason.
Notices: data use and tracking disclosures in product UIs; link to privacy policy versions.
Telemetry consent: opt-in/opt-out flags where legally required; defaults by region.

Observability & SLOs (privacy-centric)¶

Signal	Target	Notes
PII-in-logs detections	0	Fail CI and alert in prod
DSR completion time (access/erase)	≤ 30 days (legal max), target ≤ 7 days	Per tenant
Key/cert rotation on time	100% within policy	Alerts at 15/7/3 days pre-expiry
Residency violations	0	Block at edge; SEV-1 if detected
Audit write success	99.99%	Buffer & backpressure during spikes

Failure Modes & Mitigations¶

Leak of PII in logs → Real-time detector trips; block sink; open incident; purge/redact; root-cause and tests added.
Erasure partial → Re-run workflow on failed contexts; maintain compensation tasks for reindexing/caches.
Residency drift (cross-region write) → Gate at edge; quarantine records; migrate back and audit.
Key rotation failure → Fall back to previous valid key in dual window; open incident; rotate manually.

Solution Architect Notes¶

Treat classification tags as part of the domain schema; they travel with events and telemetry.
Default all products to minimal collection; require explicit ADR for any expansion in data scope.
Keep privacy enforcement in code (not just policy docs): dedicated serializers, DTOs without PII by default, and safe logging wrappers.
Build DSR orchestration once and reuse across all contexts; success depends on consistent IDs and comprehensive lineage of derived data.

Quality Strategy: Testing & Verification¶

Purpose¶

Ensure that every SaaS solution generated by the factory meets a consistent quality bar before release. Define a layered testing strategy (unit, integration, contract, end-to-end, testing-in-prod), establish standards for fixtures and synthetic checks, and enforce quality gates in CI/CD pipelines. Quality must be measurable, automated, and edition/tenant-aware.

Testing Layers¶

Unit Tests

Scope: individual classes, functions, helpers.
Isolation: no I/O; mocks/stubs for dependencies.
Coverage: critical business logic, edge cases, input validation.
Tools: xUnit/MSTest + mocking libraries.

Integration Tests

Scope: service + real dependencies (DB, message bus, cache).
Run in ephemeral test environment or with containerized dependencies (Docker Compose).
Verify data persistence, event publishing, cache consistency.
CI: run after unit tests, on every PR.

Contract Tests

Consumer-Driven Contracts (CDC) for REST/gRPC/event contracts.
Validate request/response schema, headers (tenantId, traceId, edition), and error envelopes.
Ensure backward compatibility when evolving APIs.
Tools: Pact or custom CDC harness for events.

End-to-End (E2E) Tests

Full user flows (sign-up → tenant onboarding → subscription → flag evaluation).
Run against preview environments per PR and nightly in staging.
Browser/UI tests for tenant portal & admin console (Playwright/Selenium).
Must verify multi-tenancy invariants (no cross-tenant leaks).

Testing in Production (TiP)

Synthetic checks: simulate user sign-in, onboarding, and critical APIs continuously from multiple regions.
Canary releases verified by synthetic flows before general rollout.
Guardrails: PII-free test tenants with synthetic data only.

Synthetic Checks & Monitoring¶

Synthetic tenants: seeded with representative data (usage, billing, flags, notifications).
Continuous checks:
- Auth & token issuance
- Tenant onboarding latency
- Flag evaluation correctness
- Billing usage metering
- Webhook delivery (to test sink)
Alerting: failures trigger incident; linked to error budgets.

Test Data & Fixtures¶

Fixture library with tenant archetypes (SMB, Mid-market, Enterprise).
Default seeds: product editions (Free/Standard/Enterprise), feature flags, subscriptions.
Synthetic PII: generated datasets for GDPR-safe testing (never real customer data).
Replayable scenarios: standard JSON/SQL seeds for onboarding, billing events, AI orchestration.
Secrets: fake keys/tokens for tests; real secrets only in prod via Key Vault.

Quality Gates in CI/CD¶

Unit test coverage: ≥ 80% for critical domains, reported in PR.
Integration/contract tests: must pass on PR merge; CDC verified against consumers.
Static analysis: code quality, security scans (PII detection, secret scanning).
E2E smoke tests: run on preview env per PR; block merge if failures.
Performance regression checks: latency budgets per context enforced in load tests.
Observability smoke tests: ensure traces/logs/metrics present in preview env.

Observability of Quality¶

Dashboards show test pass rate trends, time-to-fix broken builds, and coverage by bounded context.
Trace synthetic flows with special tenant IDs (synthetic-*) for easy filtering.
Alerts if coverage drops below baseline or if repeated flaky tests exceed threshold.

Risks & Mitigations¶

Risk	Impact	Mitigation
Flaky E2E tests	Erode trust, block CI	Quarantine flaky tests; retry-once policy; root-cause analysis required
Synthetic checks masking real issues	Blind spots	Expand test tenant archetypes; rotate scenarios regularly
Contract drift	Breaking downstream consumers	Enforce CDC tests in CI; require ADR for breaking changes
PII in test data	Compliance risk	Always use synthetic/fake data; automated scanners in CI

Solution Architect Notes¶

Build one reusable test harness per layer (unit, integration, CDC, E2E) and apply it across all generated services.
Treat synthetic tenants as production monitoring assets: they validate real infrastructure and SLOs continuously.
Quality is enforced by automation, not heroics; broken builds must be fixed before merge.
Integrate quality dashboards into the Admin Console for operator visibility (test status, SLO compliance, error budgets).

CI/CD & Release Strategy¶

Purpose¶

Define a standardized, reusable CI/CD blueprint for all SaaS products generated by the factory. The strategy ensures consistent build → test → promote → release flows with policy gates, preview environments, artifact signing, and progressive delivery patterns. It provides ready-to-use pipeline templates in Azure DevOps, extendable to GitHub Actions or other CI/CD platforms.

Branching & Source Strategy¶

Main Branch (main) Always releasable; mirrors production environment.
Feature Branches (feature/*) Used for development; PRs required for merge; auto-preview environment deployment.
Release Branches (release/*) Optional; for staged rollouts, LTS or regulated products.
Hotfix Branches (hotfix/*) Patch critical issues in production; merge back to main and develop.

Policy: All merges to main require passing quality gates (tests, security scans, coverage, lint).

Pipeline Topology¶

Stages

Build
- Restore dependencies, compile, lint, unit test, coverage report.
- Build container images (multi-arch if needed).
- Sign artifacts and produce SBOM (Software Bill of Materials).
Test
- Integration + contract tests against containerized dependencies (SQL, Service Bus, Redis).
- Security scans (SAST, dependency CVE checks, IaC validation).
- Quality gates: block on failing tests, PII detectors, or critical vulnerabilities.
Package & Publish
- Push signed images to Azure Container Registry (ACR).
- Publish Helm charts, IaC bundles (Bicep/Pulumi), and API contracts (OpenAPI/Protobuf).
Deploy (Preview/Stage/Prod)
- Deploy via Helm to ACA/AKS depending on target.
- Preview environments per PR (auto-provisioned; ephemeral).
- Stage mirrors production topology for final validation.
Release
- Progressive delivery patterns (blue/green, canary).
- Feature flag toggles for dark launches.
- Automatic rollback if error budget breached.

Artifact Flow¶

flowchart LR
  Code[Source Repo] -->|PR| Build
  Build --> Test
  Test --> Package
  Package --> ACR[(Azure Container Registry)]
  Package --> Helm[Helm Charts Repo]
  ACR --> StageEnv[Staging Env]
  Helm --> StageEnv
  StageEnv --> ProdEnv[Production Env]
  ProdEnv --> Users[Tenants]

Hold "Alt" / "Option" to enable pan & zoom

Immutable artifacts: all builds produce signed, versioned images (app:1.2.3+commit.sha).
SBOM & provenance: stored in artifact registry; verifiable against policy gates.

Preview Environments¶

Every PR spins up an ephemeral namespace with isolated ingress (pr-123.factory.example.com).
Uses feature branch images + seeded synthetic tenants.
Destroyed automatically on PR close/merge.
Includes OTel telemetry, synthetic flows, and seeded test data.

Policy Gates¶

Enforced in pipelines and at merge time

Linting & Static Analysis: must pass with no critical issues.
Unit/Integration Coverage: ≥ 80% for critical services.
SAST & Dependency Scans: block on CVEs above CVSS 7.
Contract Tests: must pass against consumer suites.
Compliance Checks: PII/secret scanning, license compliance.
Artifact Signing: unsigned artifacts rejected downstream.

Progressive Release Patterns¶

Blue/Green Deploy new version alongside old, switch traffic via Gateway route. Rollback = switch back.
Canary Route fraction of traffic (e.g., 5%/10%/25%) to new version; auto-promote or rollback based on SLOs.
Dark Launch Release behind feature flag; only test tenants see feature until flag flipped.
Ring-Based Rollout by tenant tier (internal → Free tenants → Standard → Enterprise).

Environment Topology¶

Dev: local + CI ephemeral test containers.
Preview: per PR; ephemeral; seeded with synthetic tenants.
Stage: mirrors Prod, stable integration tests; nightly E2E.
Prod: multi-region active/active for critical services, edition-aware quotas.

Observability in CI/CD¶

Pipelines emit telemetry (duration, pass/fail, flaky tests).
Deployment traces tagged with commit SHA, PR ID, and tenant context for preview.
Release dashboards: SLO compliance, error budgets, rollout progress.

Risks & Mitigations¶

Risk	Impact	Mitigation
Long-lived preview envs	Cost, stale tenants	TTL auto-cleanup, manual extend flag
Flaky tests blocking merges	Slows releases	Retry-once policy, flaky test quarantine
Progressive release misconfig	Partial outage	Automated rollback, rollback runbooks
Supply chain attack	Compromise of dependencies	Artifact signing + SBOM + dependency scanning

Solution Architect Notes¶

Treat pipelines as product templates: every generated SaaS inherits the same CI/CD skeleton.
Keep artifact promotion immutable: rebuilds are not allowed between Stage and Prod.
Bake error budgets into progressive rollout logic (auto-stop rollout if violated).
Preview environments are not optional—every PR must deploy to ensure shift-left validation.

Infrastructure & Platform Runtime¶

Purpose¶

Define a portable, Azure-first runtime topology for the SaaS Factory, covering compute (AKS/ACA), messaging (Service Bus), data (Azure SQL + optional Mongo/Redis/Blob), identity (Managed Identities, Key Vault), networking (private endpoints, WAF), and observability (OTel → Prometheus/Grafana/Logs). This baseline is multi-tenant aware, enforces zero trust, and is designed to be materialized via IaC (Pulumi/Bicep).

Resource Topology (high-level)¶

flowchart TB
  subgraph Edge
    AFD[Azure Front Door + WAF]
  end

  subgraph App
    AKS[(AKS)]:::core
    ACA[(Azure Container Apps)]:::core
    GATE[API Gateway (YARP)]
    SVC[Core Microservices]
    JOBS[Jobs/KEDA/Hangfire]
  end

  subgraph Data
    SQL[(Azure SQL PaaS)]
    MONGO[(MongoDB Atlas/AKS opt)]
    REDIS[(Azure Cache for Redis)]
    SB[(Azure Service Bus)]
    BLOB[(Azure Blob Storage)]
    KV[(Azure Key Vault)]
  end

  subgraph Observability
    OTL[OTel Collectors]
    LOGS[Log Analytics / Storage]
    GRAF[Prometheus/Grafana]
  end

  AFD -->|TLS 1.3| GATE
  GATE -->|mTLS| SVC
  SVC -->|AMQP| SB
  SVC -->|ADO.NET| SQL
  SVC -->|Redis| REDIS
  SVC -->|Blob SDK| BLOB
  SVC -.-> MONGO
  JOBS --> SB
  AKS --> OTL
  ACA --> OTL
  OTL --> LOGS
  OTL --> GRAF
  SVC --> KV
  GATE --> KV

  classDef core fill:#0b6,stroke:#094,color:#fff;

Hold "Alt" / "Option" to enable pan & zoom

Compute Runtime¶

AKS (Kubernetes)

Best for high scale, service mesh (mTLS), advanced network policies, sidecars (e.g., OTel, authz), and custom autoscaling.
Use managed identities for pods, CNI with Calico (or Azure CNI), and PodSecurity standards (baseline/restricted).

ACA (Azure Container Apps)

Simpler ops; KEDA-native autoscaling on HTTPQ/CPU/Service Bus lag.
Ideal for jobs, event processors, and smaller footprints.
Can run alongside AKS (hybrid) for jobs plane and burst capacity.

Baseline rule: same container images and contracts run on both; runtime choice is an operational concern per product/environment.

Networking & Identity¶

Perimeter: Azure Front Door + WAF terminates public TLS; only the Gateway is internet-exposed.
Private Networking: services/data in private subnets; deny by default.
Private Endpoints: Azure SQL, Service Bus, Storage, and Key Vault via Private Link; no public exposure.
mTLS Everywhere: Gateway↔Services and Service↔Service; cert rotation automated.
Workload Identity: AKS/ACA workloads use Managed Identities; no static secrets in pods.
Egress Control: user-defined routes + firewall for outbound; webhook egress allow-lists.

Data Layer¶

Azure SQL (Primary Store)
- Pooled tenancy by default (RLS or repository guards).
- Geo-replication, automated backups, PITR; TDE + optional CMEK.
- Per-context schemas with strict ownership; migrations gated in CI.
MongoDB (Optional)
- For document-heavy payloads (e.g., notification templates).
- Private peering or self-hosted on AKS with operator; encryption at rest.
Redis
- Low-latency flag evaluation, token/claims cache, idempotency windows.
- TLS-only; tenant-prefixed keys; eviction policies controlled per module.
Blob Storage
- Artifacts, exports, evidence bundles, AI outputs.
- Container per context; immutability policies for audit exports.

Messaging Layer¶

Azure Service Bus
- Topics/queues per bounded context; DLQs enabled; MaxDeliveryCount tuned per workload.
- MassTransit implements outbox/inbox, retries with jitter, and saga orchestration.
- Namespace partitioning (prod vs non-prod; per region).
- Private Endpoints; role assignments to workloads via managed identity.

Observability Stack¶

OpenTelemetry Collectors as DaemonSet/sidecar (AKS) or as ACA app for ingestion.
Metrics to Prometheus (managed or self-hosted), dashboards in Grafana.
Logs to Log Analytics + long-term archive to Blob.
Trace context propagated via W3C (traceparent), with mandatory attributes (tenantId, edition).
Alerting: SLO-based alerts wired to incident channels; error budgets shown in dashboards and Admin Console.

Security & Compliance Baselines¶

Zero Trust: no implicit trust; all traffic authenticated and authorized.
Key Vault for secrets, keys, and certs; automated rotation; Just-In-Time access for operators.
Container security: image signing/SBOM; policy gates deny unsigned or CVE-violating images.
Data residency: per-tenant region enforced at edge; DR within paired region only.
Audit: append-only audit store; WORM policies for exports.

Environments & Promotion¶

Dev/Preview: ephemeral namespaces per PR; seeded synthetic tenants.
Stage: mirrors prod; shadow/canary experiments; contract & load testing.
Prod: multi-zone AKS/ACA; optional active-active across regions for critical services.
Promotion: immutable artifacts from ACR; Helm/OPA policies guard releases.

Autoscaling & Resilience¶

HTTP services: HPA (AKS) on CPU/RPS/p95 latency; ACA on RPS/queue lag.
Workers: KEDA triggers on Service Bus depth/lag; DLQ replayers isolated.
Resilience policies: timeouts, retries (idempotent), circuit breakers, bulkheads defined per dependency.
Chaos: scheduled chaos experiments in non-prod; steady-state SLOs verified.

Resource Graph (baseline components)¶

Area	Azure Resource
Edge	Front Door Standard/Premium, WAF Policy, Public DNS
Compute	AKS (node pools per workload class), ACA Environment
Networking	VNets, Subnets, Private DNS Zones, Firewall/UDR
Identity	Managed Identities, Key Vault
Messaging	Service Bus Namespace (Topics/Queues, DLQ)
Data	Azure SQL (Geo-repl), Redis, Blob Storage, optional Mongo
Observability	Log Analytics, Managed Prometheus/Grafana, OTel Collectors
Security	Defender for Cloud policies, Microsoft Entra ID Conditional Access (for ops)

IaC Strategy (Pulumi-first; Bicep optional)¶

Stacks per environment: dev, stage, prod, with region parameters and SKU right-sizing.
Composition: core platform stack (network, identity, observability) + workload stacks (bus, data, compute).
Policy-as-code: OPA/Conftest in CI to validate IaC against security/compliance baselines.
Outputs: connection endpoints via Private Link, DNS zones, MI principal IDs, and signed URLs for bootstrap.

Failure Modes & Recovery¶

Regional Service Bus degradation → autoscale consumers, buffer at producers with backpressure; consider cross-namespace failover for enterprise-grade products.
SQL hotspot → promote tenant to schema/DB; enable read replicas for heavy reads; cache hot lookups in Redis.
Key Vault throttling → application-side caching with TTL; staggered rotation windows.
Front Door anomaly → failover to secondary profile; serve maintenance page; keep Admin APIs internal-only until edge recovers.

Solution Architect Notes¶

Start small with ACA where possible; graduate to AKS for mesh, sidecars, or complex multi-tenant scaling.
Keep Private Link ubiquitous—do not allow public endpoints on data plane resources.
Make observability a dependency: deploy OTel & dashboards before tenant-facing services.
Enforce workload identity end-to-end; any secret-based exception must be time-bound with an explicit waiver and monitoring.

FinOps & Cost Governance¶

Purpose¶

Ensure the SaaS Factory produces solutions that are cost-efficient, scalable, and predictable across tenants and editions. This requires embedding quotas, autoscaling rules, budget alerts, and cost dashboards into the baseline platform. By enforcing edition-aware policies, the factory guarantees fairness, prevents noisy-neighbor risks, and aligns operating costs with business revenue models.

FinOps Principles¶

Cost visibility by tenant/edition: every workload and resource tagged with tenantId, edition, productId.
Guardrails over gates: developers can ship features quickly but cannot bypass budget alerts or quota checks.
Edition-aware scaling: Enterprise tenants receive higher quota ceilings; Free tenants constrained by default.
Performance and cost together: every feature must have a defined capacity model and cost impact.

Capacity & Scaling Model¶

Compute (AKS/ACA)
- Autoscale on CPU, RPS, and queue depth (KEDA).
- Quotas mapped to edition (e.g., Free = 1 vCPU cap, Enterprise = 8 vCPU burst).
Messaging (Service Bus)
- Namespace throughput units allocated by tenant tiers.
- DLQ growth alerts trigger cost/performance investigation.
Data (SQL, Blob, Redis)
- Tenant-level caps (rows, storage GB) with upgrade paths.
- Elastic pools for pooled tenants; isolated instances for Enterprise.
Observability
- Log retention tiered: Free = 7d, Standard = 30d, Enterprise = 90d+.
- High-cost metrics (e.g., detailed tracing) reserved for Enterprise or by opt-in flag.

Edition-Aware Quotas¶

Resource	Free	Standard	Enterprise
API Rate Limit (req/min)	60	600	3000
Storage (GB)	1	50	500+
Concurrent Jobs	5	50	200
Webhooks	1 sub, 7d logs	5 subs, 30d logs	20 subs, 90d logs
Observability Retention	7 days	30 days	90 days

Quotas are enforced at gateway, service API guards, and DB-level policies.

Cost Dashboards & Tagging¶

Tagging policy: all resources tagged with env, product, tenantId, edition.
Dashboards: Azure Cost Management + Grafana integration, filtered by product/tenant/edition.
KPIs tracked:
- Cost per tenant per month
- Cost per active user (CPU hours / requests)
- Cost of observability (log/metric ingestion rates)
- Idle resource cost vs. active usage

Budget Alerts & Guardrails¶

Budgets set per environment (Dev/Stage/Prod) and per product.
Alerts triggered at 50/75/90/100% thresholds.
Automated Slack/Teams notifications to product teams.
Stopgap policies: throttle non-critical workloads if spend breaches thresholds in Free/Trial environments.

Performance Budgets¶

Each service defines baseline throughput, latency, and cost envelope.
Perf tests run as part of release process:
- Tenant onboarding latency ≤ 60s (p95)
- API p95 latency ≤ 200 ms (reads) / 350 ms (writes)
- Cost per 1000 requests within target band (< $0.05 for pooled tenants)
Perf regression gates in CI/CD: load tests in stage environment with synthetic tenants; fail build if >10% degradation.

Observability for FinOps¶

Traces include tenantId + cost-relevant metrics (CPU time, DB queries, cache hits).
Metrics: cost_per_tenant, quota_consumed_percent, autoscale_events.
Alerts:
- quota_consumed_percent > 90% for Standard/Enterprise → notify tenant admin via portal.
- autoscale_events > threshold → investigate cost/perf drift.

Risks & Mitigations¶

Risk	Impact	Mitigation
Noisy neighbors in pooled DB	Performance degradation	Quotas, RLS, hot-tenant promotion to schema/db
Cost overrun in observability	High OPEX	Tiered retention, sampling, enterprise opt-in
Under-provisioned Enterprise tenant	SLA breach	Autoscale headroom, proactive capacity tests
Free tenants gaming system	Abuse of resources	Strict quotas, rate limits, throttling policies

Solution Architect Notes¶

Bake FinOps into templates: every service scaffold includes quota configs, cost tags, and perf test harness.
Expose cost visibility to tenants: portal shows per-tenant usage and cost breakdowns (transparency + upsell driver).
Use synthetic perf tenants in stage to continuously model cost-per-tenant at scale.
Cost + performance are first-class SLOs: treat regressions in either as blockers to release.

Governance, ADRs & Change Management¶

Purpose¶

Establish a structured governance model for architectural decisions, API contracts, and platform evolution. The goal is to ensure that every SaaS product generated by the factory is traceable, reviewable, and predictable in its decision-making, while allowing for safe innovation through structured change management.

ADR Process & Repository¶

Format: log4brains ADRs in Markdown (docs/adr/).
Structure:
- Title & ID (incremental, 0001-title.md)
- Status: Proposed / Accepted / Superseded / Rejected
- Context → Decision → Consequences → Alternatives → References
Lifecycle:
1. Draft ADR opened with PR.
2. Peer review via Architecture Guild or Review Board.
3. Accepted ADR merged → published in ADR site.
4. If superseded, new ADR links back with rationale.
Templates: ADR template provided in docs/templates/adr-template.md.

Decision Lifecycle¶

Trigger: new requirement, tech evaluation, incident, compliance need.
Proposal: ADR draft with context/problem/alternatives.
Review: async discussion in PR; optional Architecture Guild sync.
Decision: maintainers approve/merge ADR.
Implementation: tracked via linked Azure DevOps epics/features.
Sunset: decision reviewed after 12–18 months, or upon major platform change.

Principles:

One decision per ADR.
ADRs are immutable once accepted (except metadata).
Supersession, not deletion.

Review Boards¶

Architecture Guild: senior engineers, product architects, security/privacy officers.
Review cadence: weekly triage; quarterly backlog review.
Charter:
- Maintain decision traceability.
- Balance innovation with stability.
- Escalate major trade-offs (cost, compliance, security).
Voting model: consensus preferred, fallback majority.

Contract Versioning & Deprecation¶

API Contracts (REST/gRPC/events)

Source of truth in contracts/ folder (OpenAPI/Protobuf/JSON Schemas).
Versioning:
- REST: /v1/..., /v2/...
- gRPC: package service.v1
- Events: eventName.v1, eventName.v2
Compatibility: additive changes allowed in same version; breaking changes → new major.

Deprecation & Sunset Policy

Announcement: contract flagged as deprecated in spec + docs.
Headers: Deprecation, Sunset, and Link to changelog returned in responses.
Timeline:
- Minimum 12-month overlap for deprecated APIs.
- Enterprise tenants can negotiate extended support.
- Telemetry monitors usage of deprecated versions.
Removal: only after telemetry shows <1% usage + published sunset date passed.

Governance & Change Management¶

Config changes: go through Config Change Review (CCR) process, with audit trail in Git + change tickets.
Schema changes: must be backward compatible; validated in CI via migration tests.
Secrets/keys: rotated via automation; ADR required if new secret store introduced.
Edition policies: changes to edition entitlements logged as ADRs + documented in release notes.
Emergency changes: handled via expedited ADR (lightweight doc, ratify post-incident).

Observability of Decisions¶

ADR site (log4brains + MkDocs Material) embedded in developer portal.
Decision dashboards: show active ADRs, deprecated ones, and upcoming sunsets.
Traceability: every epic/feature in Azure DevOps links to one or more ADR IDs.

Risks & Mitigations¶

Risk	Impact	Mitigation
Decision sprawl	Inconsistent approaches across services	Central ADR repo + guild oversight
Slow reviews	Blocks delivery	Async PR reviews + time-boxed discussions
Deprecated APIs in use	Security/compliance gaps	Telemetry-driven enforcement; forced sunset dates
Lack of traceability	Regulatory audit failures	ADR portal + linked DevOps work items

Solution Architect Notes¶

Treat ADRs as first-class artifacts: they evolve the platform as much as code.
Encourage lightweight but frequent ADRs — better many small scoped decisions than one bloated doc.
Deprecation without telemetry is a blind spot: enforce contract usage dashboards.
Change governance should be templatized in the factory so every new product inherits the same discipline.

Risk Register, Security Exceptions & Waivers¶

Purpose¶

Establish a centralized, continuously updated risk register for the SaaS Factory and its generated products. Define a matrix-based scoring model (likelihood × impact), provide structured mitigation plans, and standardize the handling of security exceptions and waivers. This ensures risks are visible, time-bound, and owned, while maintaining regulatory and audit readiness.

Risk Matrix Model¶

Scoring Dimensions

Likelihood: Rare (1), Unlikely (2), Possible (3), Likely (4), Almost Certain (5)
Impact: Negligible (1), Minor (2), Moderate (3), Major (4), Critical (5)

Risk Level = Likelihood × Impact

Score	Level	Response
1–4	Low	Accept; monitor
5–9	Medium	Mitigate or accept with waiver
10–16	High	Mitigation required, track in DevOps
17–25	Critical	Immediate remediation, exec visibility

Example Risk Register¶

ID	Risk Description	Likelihood	Impact	Score	Owner	Mitigation	Status
R-001	PII leakage in logs	3	5	15 (High)	Security Eng	OTel PII scrubbing, CI scanners, redaction lib	Mitigation in progress
R-002	Multi-tenant data bleed (RLS misconfig)	2	5	10 (High)	DB Architect	Row-level security tests, tenantId invariant, contract tests	Open
R-003	Stale TLS certs (expired)	3	3	9 (Medium)	Ops Lead	Automated cert rotation, alerts 30/7/3 days	Closed
R-004	Dependency CVEs in OSS libs	4	4	16 (High)	Eng Enablement	Renovate/Dependabot, CVE scanning in CI	Ongoing
R-005	Excess observability costs (log storm)	4	2	8 (Medium)	FinOps	Sampling, quotas, alerting	Open

Security Exceptions & Waivers¶

Workflow

Request: Team submits exception form with context, rationale, alternatives considered.
Review: Security Board evaluates risk level and business justification.
Approval: If granted, exception logged with expiry date and remediation plan.
Tracking: Exception ID linked to DevOps work item; progress reviewed monthly.
Expiry: Automatic reminders 30/7/1 days before expiration; must renew or close.

Waiver Metadata (template)

Waiver ID: SEC-WVR-2025-001
Requested by: Team/Owner
Scope: service/context affected
Risk reference(s): R-001, R-004
Justification: why no immediate mitigation possible
Expiry date: e.g., 90 days
Mitigation plan: concrete steps + timeline
Approver: Security Officer / Architecture Guild

Exception Categories¶

Technical Debt: Legacy library pending replacement.
Operational Constraint: Vendor service does not yet support mTLS.
Regulatory Delay: Awaiting legal guidance before applying stricter residency policy.
Business Urgency: Feature launch requires temporary compromise (time-boxed).

Governance & Review¶

Security & Architecture Board owns risk register reviews.
Monthly: review new risks, open waivers, and expired waivers.
Quarterly: risk posture reassessment; report to exec sponsors.
Audit: risk register and waiver log stored in Git (docs/governance/risks.md, docs/governance/waivers.md).

Observability of Risk Posture¶

Dashboards track:
- Open risks by severity.
- Exceptions expiring in next 30 days.
- Distribution of risk categories (security, compliance, cost, operational).
Alerts: Slack/Teams notification for critical risk creation or waiver expiry.

Risks & Mitigations (meta-level)¶

Meta Risk	Mitigation
“Paper-only” register (unused)	Integrate with DevOps epics; risk must have linked tasks
Perpetual waivers	Force expiry; auto-archive stale exceptions
Inconsistent scoring	Use standard matrix; train reviewers; cross-review
Lack of visibility	Dashboards & ADR references in portal

Solution Architect Notes¶

Waivers are not forever: each must be tied to a remediation timeline and sunset.
The risk register should evolve with ADRs—a major architectural decision should consider risk impact explicitly.
Treat the register as live operational telemetry, not a static doc.
Encourage engineers to proactively raise risks; the cost of under-reporting is far higher than managing a larger register.

Rollout & Tenant Onboarding / Cutover¶

Purpose¶

Provide a standardized, low-risk rollout framework for new products, editions, and tenants. Define phased rollout strategies, onboarding workflows, migration/cutover playbooks, backout strategies, and communication templates. Ensure tenant experience is consistent, compliant, and reversible in case of failures.

Rollout Strategy¶

Phased Rollouts

Internal First: internal tenants (synthetic + staff tenants) validate production infrastructure.
Pilot Tenants: 3–5 selected customers (Enterprise or early adopters).
Regional Expansion: enable per geography or residency zone.
General Availability: open onboarding to all editions.

Progressive Controls

Feature flags gate new capabilities.
Canary deployments split traffic by tenant cohort or edition tier.
Automated rollback if SLOs violated during rollout.

Tenant Onboarding Workflows¶

New Tenant Onboarding

Tenant Admin signs up (via portal or API).
Tenant entry created in Tenant Service (pooled by default).
Isolation model applied (pooled/schema/db).
Default edition + entitlements assigned.
Baseline config + feature flags seeded.
Synthetic smoke tests run under new tenant ID.
Tenant marked “active” in registry.

Migration / Cutover Flow

Trigger: product upgrade, edition change, residency migration.
Steps:
1. Freeze writes (short downtime window if needed).
2. Export + transform data (schema/edition-aware).
3. Import into new store (schema/db).
4. Validate with synthetic checks (data count, checksum, smoke tests).
5. Switch traffic at gateway (new edition endpoints).
6. Monitor SLOs; keep old infra on standby for rollback window.

sequenceDiagram
  participant Admin as Tenant Admin
  participant Portal as Portal/API
  participant TenantSvc as Tenant Service
  participant Config as Config/Flags
  participant Billing as Billing Service
  participant Audit as Audit Log
  participant Ops as Ops/SRE

  Admin->>Portal: Sign-up / Edition change
  Portal->>TenantSvc: CreateTenant/Upgrade
  TenantSvc->>Config: Seed baseline flags
  TenantSvc->>Billing: Assign subscription plan
  TenantSvc->>Audit: Record onboarding/migration
  Ops->>TenantSvc: Run synthetic tests
  TenantSvc-->>Portal: Tenant ready (status=Active)

Hold "Alt" / "Option" to enable pan & zoom

Migration & Cutover Runbooks¶

Pre-Migration Checklist
- Validate backups + PITR windows.
- Confirm tenant metadata (ID, edition, residency).
- Schedule migration window with comms.
- Run dry-run migration in staging.
Cutover Checklist
- Freeze tenant write operations.
- Take backup/snapshot.
- Run export → import pipeline.
- Validate data integrity + run smoke tests.
- Switch DNS/gateway route.
- Monitor telemetry dashboards for anomalies.
Backout Plan
- Rollback DNS/gateway to old infra.
- Restore data snapshot.
- Notify tenant of rollback and ETA for retry.

Communication Templates¶

Pre-Onboarding Email: “Your tenant environment is being provisioned. Expect ~15 minutes before services are active. You will receive confirmation when onboarding completes.”
Migration Notice: “We are upgrading your tenant to a new edition. A brief read-only window will occur from 02:00–02:30 UTC. Data integrity and continuity are guaranteed. In case of issues, we will revert within 30 minutes.”
Completion Notice: “Migration successful. All services are active. Please validate your workflows. If you encounter any issues, contact support with trace ID .”
Rollback Notice: “We reverted your tenant migration due to anomalies detected. Services remain available on the prior edition. We will reschedule and notify you once stability is confirmed.”

Observability & Readiness¶

Readiness checks: synthetic tenant login, flag evaluation, billing call, event publishing, audit log write.
Telemetry dashboards: show onboarding duration, failure rate, rollback events.
SLOs:
- Onboarding p95 ≤ 10 minutes.
- Migration success ≥ 99.5% without manual intervention.
- Rollback readiness ≤ 30 minutes.

Risks & Mitigations¶

Risk	Impact	Mitigation
Data mismatch post-migration	Tenant disruption	Checksums + synthetic tests before traffic cutover
Rollback failure	Extended outage	Immutable backups, DNS-based rollback, ops runbooks
Communication gaps	Customer dissatisfaction	Standard comms templates; automated notifications
Edition misassignment	Wrong entitlements	Config seeding automation; validation against edition matrix

Solution Architect Notes¶

Treat onboarding as code: workflows defined in automation (Pipelines/Functions), not manual ops.
Every migration must have a rollback plan validated in non-prod.
Prefer progressive cutover (tenant cohorts) over “big bang” migrations.
Expose self-service onboarding APIs but enforce strict tenancy validation and audit logging.

Operations, DR & BCP¶

Purpose¶

Define the operational framework for running the SaaS platform in production, including incident response models, disaster recovery (DR) objectives, business continuity planning (BCP), and continuous improvement loops. Ensure resilience by codifying RTO/RPO targets, runbooks, escalation policies, and post-incident learning practices.

Incident Response Model¶

Principles

Always-on detection: observability signals drive automated alerting.
Severity-driven triage: classify incidents by impact scope (tenant, edition, platform-wide).
Clear ownership: each bounded context has an on-call roster with escalation paths.
Blameless culture: focus on systemic fixes, not individual blame.

Severity Levels

Sev	Example	Response Target
1 (Critical)	Multi-tenant outage, data breach	Respond ≤ 5 min, resolve ≤ 2h
2 (High)	Single-tenant critical service loss	Respond ≤ 15 min, resolve ≤ 4h
3 (Medium)	Performance degradation, delayed jobs	Respond ≤ 1h, resolve ≤ 24h
4 (Low)	Cosmetic UI issue, minor bug	Triage in backlog

Escalation

PagerDuty/Teams alerts → On-call engineer → Escalation to service lead → SRE/Architecture Guild if systemic.
Stakeholder comms via status page + tenant emails.

DR/BCP Targets¶

Disaster Recovery Objectives

RTO (Recovery Time Objective): ≤ 1 hour for core services, ≤ 4 hours for non-critical.
RPO (Recovery Point Objective): ≤ 15 minutes for transactional data (SQL, Service Bus), ≤ 1 hour for logs/telemetry.

Continuity Scenarios

Regional outage: failover to paired Azure region (e.g., West Europe ↔ North Europe).
Service degradation: reroute workloads to ACA fallback if AKS cluster impaired.
Data corruption: restore from PITR backups with integrity checks.
Identity outage: cached tokens + reduced-mode features until IdP restored.

Runbooks & Playbooks¶

Examples

Outage Playbook:
1. Identify impacted tenants/editions.
2. Trigger failover runbook (DNS cutover, service redeploy).
3. Notify stakeholders.
4. Verify SLO restoration.
Data Corruption:
1. Halt writes.
2. Restore PITR snapshot.
3. Re-run integrity checks.
4. Replay DLQ events.
Service Bus Saturation:
1. Scale consumers via KEDA.
2. Purge DLQ to holding store for replay.
3. Tune retry backoff.
Tenant Migration Failure:
1. Rollback to previous DB/schema.
2. Reassign tenant routing.
3. Notify tenant with comms template.

Runbooks stored in docs/runbooks/*.md; each includes steps, required roles, estimated timings, rollback procedures.

Post-Incident Review (PIR)¶

Held within 72 hours of Sev-½ incident.
Format: What happened? What was the impact? What went well? What failed? What do we improve?
Outputs:
- Incident report (public/tenant redacted version + internal detailed version).
- Linked DevOps items for remediation.
- SLA/SLO impact recorded for trend tracking.

Blameless retrospectives mandatory for cultural reinforcement.

Operational KPIs & Continuous Improvement¶

Key Metrics

MTTR (Mean Time to Recovery)
MTTA (Mean Time to Acknowledge)
SLA compliance (%)
Error budget burn rate
Change failure rate (CFR)
Incident recurrence rate

Continuous Improvement Loop

Collect KPIs + PIR findings.
Update ADRs or runbooks as needed.
Feed back into test automation (synthetic checks for past failures).
Share learnings across product teams in guild sessions.

Observability & Automation¶

SLO dashboards per service, tenant-aware.
Error budget alerts tied to rollout progression (halt if exceeded).
Runbook automation: scripts for DNS cutover, database restore, tenant reroute.
Chaos engineering: scheduled DR drills to validate RTO/RPO in practice.

Risks & Mitigations¶

Risk	Impact	Mitigation
Unclear on-call ownership	Delayed response	On-call schedules automated; service owner registry
DR drills skipped	False sense of security	Mandatory quarterly tests with reports
PIRs not actioned	Repeat incidents	Link PIR items to DevOps with SLA
Overloaded on-call team	Burnout	Rotations, SLO budgets, escalation to guilds

Solution Architect Notes¶

Treat DR/BCP as code: IaC templates for failover infra, automated DNS cutovers, scripted PITR restores.
Ensure multi-tenant context in all incident workflows (know who’s impacted immediately).
Regularly test rollback and failover—don’t assume Azure SLAs replace DR.
Continuous improvement is the differentiator: every incident should make the platform stronger.

Real-Life Example SaaS Products¶

Purpose¶

Ground the high-level design in practical, scenario-based examples that demonstrate how the factory can generate distinct SaaS offerings using its reusable patterns. These examples highlight flexibility across industries, compliance requirements, and AI-first capabilities.

Example 1: SaaS CRM Lite¶

Target Tenants: SMEs and startups.
Core Contexts Used: Identity, Tenant Management, Config, Notifications.
Data Model: Pooled DB (multi-tenant shared schema with RLS).
Editions:
- Free → 5 users, basic CRM (contacts, tasks).
- Standard → unlimited users, reporting dashboards.
- Enterprise → advanced workflows, SSO, API access.
Extensibility:
- Webhooks for customer lifecycle events.
- Slack and Microsoft Teams integration via outbound events.
- REST APIs for partner add-ons.

Why it works: A simple product leverages the factory’s core scaffolding (identity, multi-tenancy, config) with minimal customization.

Example 2: Healthcare SaaS (HIPAA Overlay)¶

Target Tenants: Clinics and medical practices.
Overlays Applied: HIPAA compliance, immutable audit trails, PHI encryption, data residency enforcement.
Editions:
- Standard → support for multiple clinics under one tenant.
- Enterprise → per-facility isolation (schema-per-tenant or DB-per-tenant).
Integrations:
- FHIR APIs to connect with EHR systems.
- Secure patient notifications (SMS/email with consent).
- Billing and insurance workflows.
Data Controls:
- Tokenization of PHI in audit logs.
- WORM storage policies for compliance exports.

Why it works: Demonstrates the policy overlays and regulatory guardrails built into the factory (GDPR/HIPAA packs).

Example 3: AI Knowledge SaaS¶

Target Tenants: Universities, training providers, research institutions.
Core Feature: AI-first orchestration via Semantic Kernel.
Capabilities:
- AI tutors (multi-agent orchestration).
- Knowledge search across uploaded materials.
- Context-aware Q&A with tenant-isolated embeddings.
Edition Features:
- Free → limited AI queries (e.g., 50/month).
- Standard → shared embeddings, course recommendations.
- Enterprise → private embeddings, multi-agent orchestration, ingestion of tenant-specific datasets.
Observability:
- Per-tenant AI usage dashboards.
- Quotas to prevent overuse and ensure fair cost allocation.

Why it works: Showcases factory support for AI workloads (semantic orchestration, tenant-aware embeddings, quotas) and highlights the ability to build future-ready products.

Key Takeaways¶

The factory is industry-agnostic: from lightweight CRM to regulated healthcare to AI-first products.
Edition and overlay packs adapt the same baseline platform for different compliance and market needs.
Multi-tenancy and extensibility patterns (webhooks, APIs, events) repeat across all products.
The platform’s design ensures reusability, speed-to-market, and compliance by default.

Conclusion & Summary¶

Why the Factory Exists¶

The ConnectSoft SaaS Factory was conceived to dramatically reduce time-to-market for SaaS solutions while embedding non-negotiable guardrails for security, compliance, and reliability. By abstracting away repeated engineering concerns — identity, tenancy, observability, compliance, CI/CD, FinOps — the factory allows product teams to focus on differentiated business value instead of reinventing the same platform foundations.

Core Pillars¶

The design presented in this HLD is anchored on four cross-cutting pillars:

Security by Design — OAuth2/OIDC, workload identities, encryption everywhere, data classification, policy enforcement.
Multi-Tenancy & Editions — pooled vs. isolated tenancy, per-tenant config/flags, edition overlays, migration paths.
Observability & Resilience — OTel-first telemetry, traces/logs/metrics, error budgets, SLO dashboards, chaos testing.
AI-First Orchestration — Semantic Kernel and Microsoft.Extensions.AI enable agentic workflows, intelligent assistants, and future-ready product capabilities.

Together, these pillars ensure every generated SaaS product is compliant, scalable, and adaptable.

Blueprint Completeness¶

Across the sections, the document has defined:

Vision & Personas → what problems we solve, for whom.
Bounded Contexts & Architecture → how services decompose, integrate, and evolve.
Non-Functional Guarantees → SLOs, privacy, compliance, resilience.
Operational Excellence → CI/CD pipelines, FinOps guardrails, risk registers, DR/BCP.
Practical Examples → how real SaaS products (CRM, Healthcare, AI Knowledge) emerge from the same templates.

This represents a complete high-level blueprint for both the factory platform and the SaaS solutions it produces.

Path Forward¶

This HLD is not static. It is the foundation for implementation, and will evolve through:

ADRs: Architecture Decision Records documenting trade-offs and changes.
Epics & Features: Work planned and executed in Azure DevOps.
Continuous Improvement: Post-incident reviews, PIR-driven ADRs, and refinements to templates and runbooks.

Every new SaaS product built with the factory contributes learnings and enhancements back into the core templates, making the platform stronger with each iteration.

Final Note¶

The ConnectSoft SaaS Factory provides a repeatable, governed, AI-driven model for delivering SaaS at scale. It balances innovation velocity with enterprise-grade guardrails, ensuring that each generated solution is secure, observable, multi-tenant aware, and ready for production from day one.

This document closes the High-Level Design phase and marks the transition to detailed design and implementation.

📋 ConnectSoft SaaS Factory — High-Level Design (HLD)¶

Introduction¶

Context¶

Audience¶

Scope of Document¶

Vision, Outcomes & Scope¶

Purpose¶

Problem Space¶

Target Value Map¶

High-Level OKRs¶

Explicit Non-Goals¶

Guardrails¶

Personas, Tenants & Editions¶

Personas¶

Tenant Archetypes¶

Editions¶

Tenant × Edition × Capability Matrix¶

Top Use Cases¶

Strategic Domains & Context Map¶

Purpose¶

Bounded Contexts (overview)¶

Integration Styles & Collaboration¶

Context Relationships (narrative)¶

Context Map (Mermaid)¶

Anti-Corruption Layer Patterns¶

Tenancy & Security Considerations¶

Observability & SLO Notes¶

Evolution Principles¶

Solution Architect Notes¶

Non-Functional Requirements & SLOs¶

Purpose¶

Performance¶

Availability¶

Scalability¶

Security & Privacy¶

Observability¶

Compliance & Portability¶

SLI/SLO Baseline Table¶

Error Budgets¶

Solution Architect Notes¶

Platform Reference Architecture¶

Purpose¶

Overview¶

C4 Container (Mermaid)¶

Runtime Topologies¶

Network & Trust Boundaries¶

Identity & Secrets¶

Data & Storage Layer¶

Messaging Layer¶

Observability Plane¶

Jobs & Scheduling¶

High Availability & Scaling¶

Failure Modes & Recovery (selected)¶

Solution Architect Notes¶

Edge & API Gateway¶

Purpose¶

Ingress Topology & Trust Boundaries¶

Request Lifecycle (Edge Policies)¶

Authentication & Authorization at the Edge¶

Edition-Aware Rate Limits & Quotas (defaults)¶

Routing, Versioning & Canary¶

Request/Response Transformations (examples)¶

Resilience at the Edge¶

Observability (Gateway Signals)¶

Configuration Templates (illustrative)¶

Failure Modes & Playbooks (selected)¶

Solution Architect Notes¶

Identity, Authentication & Authorization¶

Purpose¶

Trust Boundaries & High-Level Flows¶

Identity Provider Options¶

Token Model (JWT)¶

RBAC / ABAC Authorization¶

Scope Catalog (illustrative)¶

Login & Token Issuance (sequence)¶

Tenant Resolution & Federation¶

Workload Identity (service-to-service)¶

Security & Privacy Controls¶

Observability Signals¶

Failure Modes & Mitigations¶