Skip to content

📋 ConnectSoft SaaS Factory — High-Level Design (HLD)


The ConnectSoft SaaS Factory is a generic, template-driven platform for creating SaaS solutions quickly, securely, and consistently.

Unlike traditional projects where each product is designed from scratch, the factory provides paved roads: pre-built architectural templates, infrastructure packs, and automation workflows that allow product teams to focus on business value instead of reinventing technical foundations.

Core pillars:

  • Cloud-Native & Event-Driven: Designed for scalability and resilience.
  • Security-First & Compliance-by-Default: Guardrails embedded in every template.
  • Multi-Tenant & Edition-Aware: Monetization and isolation strategies built in.
  • AI-First Orchestration: Semantic Kernel and agentic workflows accelerate delivery.
  • Observability Everywhere: Traces, logs, metrics included out of the box.

Target Benefits:

  • Cut time-to-market for new SaaS offerings from months to days.
  • Guarantee consistency across solutions, reducing operational cost.
  • Ensure regulatory compliance from the first deployment.
  • Enable innovation by freeing teams from repetitive platform work.

Introduction

Context

Organizations repeatedly face the same engineering challenges when building SaaS solutions: identity, tenancy, observability, compliance, billing, and extensibility. These components are not differentiators, yet consume majority of engineering time.

The ConnectSoft SaaS Factory addresses this by codifying best practices into reusable assets — templates, IaC modules, service packs, and architectural blueprints — that teams can compose into product recipes.

Audience

This HLD is intended for:

  • Enterprise Architects: to understand platform scope and evolution.
  • Engineering Leads & Developers: to apply templates and patterns.
  • Product & Business Stakeholders: to validate alignment with goals.
  • Operations & Security Teams: to verify compliance and guardrails.

Scope of Document

  • What this is: A high-level design of a SaaS Factory that produces SaaS solutions.
  • What this is not: A low-level implementation guide or a one-off SaaS product design.

Vision, Outcomes & Scope

Purpose

The ConnectSoft SaaS Factory exists to provide a generic, reusable platform for building and operating SaaS solutions. Instead of treating each SaaS product as a bespoke project, the factory offers a set of templates, architecture patterns, and automation packs that allow teams to bootstrap new SaaS offerings with speed, quality, and compliance already embedded.

This factory approach ensures that:

  • Time-to-market is drastically reduced through pre-built blueprints.
  • Consistency across solutions is maintained, lowering operational overhead.
  • Security and compliance are guaranteed by default, not bolted on later.
  • Observability is always enabled, allowing proactive operations.
  • Multi-tenancy and edition models are first-class, making monetization and scaling straightforward.

Problem Space

Modern SaaS builders face common challenges:

  • Repetition of patterns (auth, tenancy, billing, observability) across every product.
  • Time lost in rebuilding baseline infrastructure and compliance controls.
  • Inconsistent quality when different teams make different architectural choices.
  • High barrier to scaling due to lack of standardized multi-tenancy, resilience, and observability.
  • Slow compliance adoption (GDPR, HIPAA, SOC2) when not embedded from the start.

The SaaS Factory addresses these by offering paved roads: highly opinionated but extensible blueprints.


Target Value Map

Dimension Value Proposition Example Metric
Speed Reduce new SaaS product setup from months to days. Time from product idea → first running environment.
Consistency Unified architecture, code templates, IaC. % of services following factory patterns.
Compliance Security & privacy controls built-in. % of audits passed without major findings.
Scalability Ready-to-scale multi-tenant core. Number of tenants supported per cluster/node.
Innovation AI-first orchestration and automation. % of backlog items delivered by AI-generated scaffolds.

High-Level OKRs

Objective 1: Reduce time-to-market for SaaS products.

  • KR1: 80% of new products bootstrapped with factory templates within 1 day.
  • KR2: First production tenant live within 2 weeks of project kickoff.

Objective 2: Ensure security, compliance, and observability by default.

  • KR1: 100% of generated services include OTel, logging, metrics.
  • KR2: 100% of generated services use workload identity + secret-less design.
  • KR3: ≥ 95% compliance checklist coverage at go-live.

Objective 3: Provide consistent developer experience.

  • KR1: 90% of developers report positive DX (survey).
  • KR2: Mean time to onboard new engineer ≤ 2 days.

Explicit Non-Goals

  • Not a one-off platform: This is a factory, not a custom-built product for a single SaaS.
  • Not vendor-locked: Although Azure is primary, abstractions exist for other clouds.
  • Not monolithic: No “mega-service”; all blueprints enforce microservices and modular design.
  • Not compliance-by-documentation: Compliance is enforced technically (policy gates, guardrails), not left to paperwork.
  • Not a silver bullet: Factory does not replace product-specific business logic; teams still own domain innovation.

Guardrails

  • Security & observability cannot be disabled.
  • Multi-tenancy and editioning are always enforced.
  • ADRs required for deviations from factory templates.
  • Paved road approach: 80% covered by factory defaults, 20% extensible through overrides.

Personas, Tenants & Editions

Personas

The platform supports multiple personas across different layers of the SaaS lifecycle. These are not tied to one product but apply generically across all SaaS solutions generated by the factory:

  • Tenant Administrator Responsible for onboarding, user management, and tenant-level configuration. Uses self-service tooling to manage editions, roles, and features.

  • End User Consumes product functionality within the context of their tenant. Interacts only with features entitlements assigned by edition and policy.

  • Billing & Finance Operator Oversees subscriptions, invoices, and payments across tenants. Works with dynamic billing models and edition-based monetization.

  • Support Agent Provides operational and customer support. Needs tenant-specific observability and diagnostic views across editions.

  • Platform Operator (SRE/DevOps) Maintains infrastructure, enforces global guardrails, ensures SLA adherence across multiple SaaS solutions. Focuses on tenant isolation, scaling, and compliance.

  • Product Owner / Engineering Teams Define new products, editions, and feature packs inside the system dynamically. Responsible for extending templates to create differentiated SaaS offerings.


Tenant Archetypes

Different tenant profiles are supported dynamically, with no need to hardcode rules in the factory:

  • Trial / Evaluation Tenant Activated on sign-up, bound to a Free edition, strict quotas. Ideal for fast adoption and funnel growth.

  • Growth Tenant (SMB / Mid-Market) May purchase Standard or custom editions. Needs integration with external identity, moderate quotas, and cost-effective plans.

  • Enterprise Tenant Assigned to Enterprise edition or a custom product pack. Requires SSO, SCIM provisioning, audit exports, custom compliance, and region-specific data residency.


Editions

The factory does not limit editions to a fixed set; editions are defined dynamically in the target product configuration:

  • Free Edition (default template) Minimal feature set, reduced quotas, short SLA. Used as a baseline template for trials.
  • Standard Edition (default template) Full baseline capability pack, medium quotas, typical SLA.
  • Enterprise Edition (default template) Advanced feature pack, premium quotas, enterprise SLA.
  • Custom Editions Created dynamically by product owners in the system, composed of features, quotas, and entitlements. Example: Healthcare Pack, FinTech Pack.

All editions are described as metadata objects stored in the platform and enforced by policy engines at runtime.


Tenant × Edition × Capability Matrix

Because editions are dynamic, the following matrix represents the default templates provided by the factory. Product owners may extend or override it.

Capability Free Standard Enterprise Custom (Dynamic)
Multi-Tenancy Isolation
Identity (OIDC/SSO) ✓ (SCIM) Configurable
API Rate Limit (req/min) 60 600 3000 Configurable
Feature Flags Basic Standard Advanced Configurable
Observability Dashboards Shared Tenant Advanced Configurable
Data Residency Choice Configurable
SLA / Support 8×5 24×7 Configurable

Top Use Cases

  1. Dynamic Product Creation Product owner defines a new product, editions, and entitlements via admin console or API. The factory applies baseline templates and stores metadata.

  2. Tenant Onboarding Tenant signs up and is provisioned with the default Free edition. Admin can later upgrade or assign a custom edition without re-deployment.

  3. Edition Upgrade / Downgrade Tenant transitions dynamically between editions (e.g., Free → Standard, Standard → Enterprise) without migration downtime.

  4. Custom Edition Rollout A healthcare SaaS product defines a Healthcare Pack with HIPAA compliance, custom storage policies, and audit logging. Tenants can subscribe to this edition dynamically.

  5. Tenant-Specific Support Support agent views tenant entitlements, edition policies, and quotas to resolve feature-availability questions quickly.


Strategic Domains & Context Map

Purpose

Define the bounded contexts that compose the generic SaaS platform, clarify their responsibilities and collaboration style, and set the principles for clean integration across domains. The goal is to maximize autonomy, enable evolution without ripple effects, and keep multi-tenancy, security, and observability as first-class concerns within each boundary.


Bounded Contexts (overview)

Context Core Responsibility Primary Interfaces Persistence (default) Tenancy Notes
SaaS Core Metadata System-of-record for Products, Editions, Features, Entitlements, Quotas, Pack composition REST (admin), Events Azure SQL Enforces product/edition metadata; emits entitlement updates
Identity OIDC provider (OpenIddict/AAD), Users, Roles, Scopes, Service Principals OIDC/OAuth2, REST, Events Azure SQL Token carries tenantId, edition, entitlements claims
Tenant Management Tenant lifecycle (provision, activate, suspend, delete), Region/Residency, Tenant config bootstrap REST, Events Azure SQL Owns tenantId; author of cross-context tenant events
Billing Plans, Subscriptions, Invoicing, Usage Rating; Payment provider integrations (via ACL) REST/Webhooks, Events Azure SQL Per-tenant subscription state and entitlements sync
Configuration (Config & Feature Flags) Dynamic flags, settings, edition overrides, kill-switches REST, Events Azure SQL + Redis Per-tenant/edition flag resolution; policy guardrails
Usage & Metering API call meters, storage/compute consumption, quota enforcement REST (read), Events (write) Azure SQL (agg) + cold storage Emits usage events to Billing and Observability
Notifications Email/SMS/Webhook dispatch, templates, tenant branding REST, Events MongoDB (templates) + Queue Tenant-scoped channels; signed webhooks
Audit & Compliance Immutable audit trail, access logs, policy decisions, exports REST (query), Events (append) Azure SQL (append-only) Cross-tenant isolation; long retention & eDiscovery
AI Orchestration Agentic flows (Semantic Kernel), scaffolding, docs/tests generation, safe tool use REST, Jobs, Events Blob/Queue for artifacts Runs under least-privileged scopes; auditable tools

Notes:

• Storage choices are defaults; contexts can swap with portable equivalents.

• Each context publishes canonical events with required headers (traceId, tenantId, edition, schemaVersion).


Integration Styles & Collaboration

  • Event-first collaboration: Domain facts published as immutable events; subscribers build local read models or kick off workflows.
  • REST for command/query where necessary: CRUD/admin operations and strongly consistent reads remain RESTful within the boundary.
  • Webhooks for external extensibility: Billing, Notifications, and Audit expose signed webhooks for third-party systems.
  • Anti-Corruption Layers (ACLs): External payment, messaging, or identity providers are wrapped behind ACLs to protect internal models.
  • Outbox/Inbox patterns: All cross-context messages flow through outbox (producer) and inbox (consumer) for idempotency and reliability.

Context Relationships (narrative)

  • SaaS Core Metadata is upstream for Config, Billing, Usage, and Identity by defining products, editions, and entitlements.
  • Tenant Management is the authoritative source of tenantId lifecycle and residency; downstream systems subscribe to tenant.created/updated/deleted events.
  • Billing depends on Usage for rated consumption and on SaaS Core Metadata for plan/entitlement definitions; pushes subscription state changes to Identity (claims) and Config (feature gates).
  • Config consumes product/edition definitions and applies edition overrides and tenant-scoped flags, influencing feature availability platform-wide.
  • Audit subscribes to events from every context and exposes immutable, query-optimized views for operators and compliance exports.
  • AI Orchestration is lateral: it reads contracts and policies, generates scaffolds/PRs, and never mutates domain state without going through the owning context’s APIs.
  • Notifications consumes events across contexts to deliver comms and expose signed webhooks to tenant systems.

Context Map (Mermaid)

flowchart LR
  subgraph Upstream
    SCM[SaaS Core Metadata]
    TEN[Tenant Management]
  end

  IDP[Identity]:::core
  BILL[Billing]:::core
  CONF[Config & Flags]:::core
  USE[Usage & Metering]:::core
  AUD[Audit & Compliance]:::core
  NOTIF[Notifications]:::edge
  AIO[AI Orchestration]:::edge

  TEN -->|tenant.created/updated| IDP
  TEN --> BILL
  TEN --> CONF
  TEN --> USE
  TEN --> AUD
  TEN --> NOTIF

  SCM -->|product/edition/entitlement| BILL
  SCM --> CONF
  SCM --> IDP

  USE -->|usage.recorded| BILL
  BILL -->|subscription.activated/changed| IDP
  BILL -->|entitlements.updated| CONF

  %% Lateral
  AIO -. reads contracts & policies .-> SCM
  AIO -. opens PRs/tests .-> CONF

  %% Observability/Audit taps
  IDP --> AUD
  BILL --> AUD
  CONF --> AUD
  USE --> AUD
  NOTIF --> AUD

  classDef core fill:#0b6,stroke:#094,stroke-width:1,color:#fff;
  classDef edge fill:#357,stroke:#234,stroke-width:1,color:#fff;
Hold "Alt" / "Option" to enable pan & zoom

Anti-Corruption Layer Patterns

  • Payment Providers (e.g., Stripe, Adyen): Billing implements a Payment ACL translating provider-specific objects (invoices, payment intents) into internal subscription and charge events. Retries and signature validation live in the ACL.

  • External Identity Providers (e.g., Azure AD, Okta): Identity provides a Federation ACL for SSO and SCIM. External attributes are mapped to internal roles/scopes/claims with normalization and validation.

  • Email/SMS Gateways: Notifications uses a Messaging ACL to normalize templates, rate limits, and delivery receipts across providers.


Tenancy & Security Considerations

  • Tenant authority: Tenant Management is the sole issuer of tenantId and region/residency attributes; all contexts must validate inbound tenantId against their read model.
  • Edition enforcement: Identity tokens include edition and derived entitlements; services must authorize using policy checks, not UI hints.
  • Isolation: Data access is constrained by repository guards and (where supported) RLS/filters keyed by tenantId.
  • mTLS & Workload Identity: Cross-context calls occur over mTLS; services authenticate using workload identities, not secrets.
  • Least privilege: Each context exposes minimal operations; event consumers run with scoped permissions.

Observability & SLO Notes

  • Traceability: Every cross-context call/event carries traceId, tenantId, and edition. Spans are named ctx.operation (e.g., billing.rate-usage).
  • Golden signals per context:
    • Identity: token issuance latency p95, failure rate
    • Billing: invoice generation p95, reconciliation lag
    • Config: flag evaluation latency p95, cache hit%
    • Usage: event ingest lag p95, quota enforcement accuracy
    • Audit: write throughput, query latency p95
  • Error budgets: Defined per context with shared budgets for critical flows (e.g., onboarding path TEN→CONF→IDP).

Evolution Principles

  • Independent deployability: Contexts release on their own cadence; contracts evolve via additive changes and versioned topics/paths.
  • Schema evolution: Event versioning via type.suffix.vN; consumers must tolerate unknown fields.
  • Backwards compatibility: Deprecations follow a documented sunset window; ACLs absorb third-party churn.
  • Testing in isolation: Contract and consumer-driven tests validate behavior without requiring full-platform spin-up.

Solution Architect Notes

  • Start with SaaS Core Metadata and Tenant Management as foundation; everything else composes around their events.
  • Treat Billing and Config as policy enforcers driven by metadata and subscriptions; resist hardcoding feature switches inside services.
  • Keep AI Orchestration stateless with auditable tool calls; it should propose and scaffold, not silently mutate domain state.
  • Maintain strict Audit boundaries: append-only, immutable identifiers, and explicit export paths with data minimization.

Non-Functional Requirements & SLOs

Purpose

The SaaS Factory must deliver predictable, reliable, and secure foundations for any generated SaaS product. Non-functional requirements (NFRs) set the quality bar across dimensions like performance, availability, scalability, security, privacy, compliance, and portability. Service Level Indicators (SLIs) and Service Level Objectives (SLOs) ensure that these qualities are measured and enforced consistently.


Performance

  • API Latency:
    • p95 read latency ≤ 200 ms
    • p95 write latency ≤ 350 ms
  • Throughput:
    • Minimum 1k requests/minute per tenant on pooled editions
    • Scalable to 50k requests/minute on Enterprise

Performance requirements apply uniformly across generated services, with quotas and scaling managed by edition policies.


Availability

  • Platform Uptime: 99.9% monthly baseline (Enterprise tiers may raise this to 99.95%)
  • Critical APIs (Auth, Tenant Onboarding, Billing): ≥ 99.95%
  • Scheduled Maintenance: Maximum of 4 hours per month, communicated via status channels

Error budgets are defined per tier; for example, Free tenants may tolerate higher downtime than Enterprise tenants.


Scalability

  • Multi-Tenant Growth: Must scale to 10k+ tenants per cluster with pooled database model.
  • Elastic Scaling: Auto-scale services based on CPU/memory utilization and queue lag.
  • Edition-Aware Quotas: Rate limits, storage, and compute quotas enforced dynamically per edition.

Scalability is achieved through horizontal scaling (stateless microservices) and vertical scaling for data workloads.


Security & Privacy

  • Zero Trust Enforcement: All inter-service traffic authenticated with workload identities + mTLS.
  • Data Protection:
    • Encryption at rest (AES-256)
    • Encryption in transit (TLS 1.3)
  • Secrets Management: All secrets stored in Key Vault; secret-less by default with managed identities.
  • Privacy by Design: PII minimized, data masking in logs, erasure supported via APIs.
  • Compliance Baselines: GDPR, SOC2, HIPAA (configurable by product/edition).

Observability

  • Tracing: 100% of requests/events tagged with traceId, tenantId, edition.
  • Metrics: Golden signals for latency, errors, saturation, traffic.
  • Logging: Structured, JSON, PII redacted by default.
  • Dashboards: Predefined for each context; customizable by tenant/edition.

Observability cannot be disabled. It is a core guardrail across all generated services.


Compliance & Portability

  • Compliance-by-Default: All services scaffolded with standard controls.
  • Auditability: Immutable audit logs retained for 7 years (configurable).
  • Portability: Azure-first, but abstractions for SQL → PostgreSQL, Service Bus → Kafka, Key Vault → Vault.

Generated SaaS solutions can be redeployed across clouds without rewriting core services.


SLI/SLO Baseline Table

Dimension SLI SLO (Baseline) Notes / Error Budget
Performance p95 read/write latency ≤ 200 ms / 350 ms Budget: 5% requests may exceed
Availability Platform uptime ≥ 99.9% Free edition: 99.5%
Scalability Max tenants per cluster ≥ 10,000 Quotas per edition
Security % secrets managed via KV/MSI 100% No exceptions allowed
Privacy % requests with PII redacted in logs 100% Guardrail; cannot be disabled
Observability % requests traced w/ traceId,tenantId 100% Mandatory invariant
Compliance Audit record retention 7 years Configurable override possible

Error Budgets

  • Availability: If uptime falls below 99.9%, feature velocity is slowed until SLO compliance restored.
  • Performance: 5% budget for latency breaches; if exceeded, scaling or optimization prioritized.
  • Security/Privacy: No error budget — violations trigger incident severity 1.
  • Observability: No error budget — missing telemetry is considered a defect.

Solution Architect Notes

  • Apply edition-specific SLO overlays (Enterprise gets stricter SLOs than Free/Standard).
  • Use synthetic checks for critical flows (tenant onboarding, login, billing).
  • Embed NFRs into CI/CD as automated quality gates.
  • Reassess SLOs quarterly and align with customer contracts.

Platform Reference Architecture

Purpose

Define a reusable runtime blueprint for the generic SaaS platform across environments. The architecture emphasizes edge security, multi-tenant isolation, event-driven collaboration, and observability by default, while remaining portable between AKS and Azure Container Apps (ACA).


Overview

The platform is partitioned into clear planes and tiers:

  • Edge Plane: Front Door/WAF → API Gateway (YARP) → UI/Portal
  • Control Plane: Identity Provider, Policy/Config, CI/CD, IaC
  • Data & Messaging Plane: Azure SQL (primary), optional MongoDB, Redis, Storage, Service Bus
  • Workload Plane: Core microservices per bounded context (Identity, Tenant, Billing, Config, Usage, Notifications, Audit, AI Orchestration)
  • Observability Plane: OpenTelemetry collectors, Logs, Metrics, Traces, Dashboards/Alerts
  • Jobs Plane: Hangfire/KEDA workers, scheduled tasks, DLQ replayers

C4 Container (Mermaid)

C4Container
title Generic SaaS Platform — Container View

Person_TenantAdmin( Tenant Administrator )
Person_EndUser( End User )
Person_Operator( Platform Operator / SRE )

Container_Browser(Web UI / Tenant Portal, "SPA", "OIDC client")
Container_Admin(Admin Console, "SPA", "OIDC client")
Container_Gateway(API Gateway, "YARP", "AuthN/Z, routing, rate limits, transforms")
Container_IdP(Identity Provider, "OpenIddict/Azure AD", "Tokens, RBAC/ABAC")
Container_Svc_Tenant(Tenant Service, ".NET", "Tenant lifecycle, residency")
Container_Svc_Billing(Billing Service, ".NET", "Plans, subscriptions, invoices")
Container_Svc_Config(Config & Feature Flags, ".NET", "Flags, edition overrides")
Container_Svc_Usage(Usage & Metering, ".NET", "Meters, quotas, aggregates")
Container_Svc_Notify(Notifications, ".NET", "Email/SMS/Webhooks via ACL")
Container_Svc_Audit(Audit & Compliance, ".NET", "Append-only audit log")
Container_Svc_Metadata(SaaS Core Metadata, ".NET", "Products, editions, entitlements")
Container_Svc_AI(AI Orchestration, ".NET + SK", "Agentic scaffolding, tests/docs")
Container_Bus(Service Bus, "Topics/Queues", "Async events, DLQ")
Container_SQL(Azure SQL, "Relational", "Primary state (RLS/tenant guards)")
Container_Mongo(MongoDB (opt), "Document", "Templates & payloads (notifications)")
Container_Redis(Redis, "Cache", "Flag evaluation, token/claim caches")
Container_Storage(Storage, "Blob", "Artifacts, exports")
Container_Obs(Observability Stack, "OTel/Prometheus/Grafana/Logs", "Traces, metrics, logs")
Container_Jobs(Jobs Runtime, "Hangfire/KEDA", "Schedulers, DLQ replay")

Rel(Tenant Administrator, Web UI / Tenant Portal, "Browser, OIDC")
Rel(End User, Web UI / Tenant Portal, "Browser, OIDC")
Rel(Platform Operator / SRE, Admin Console, "Browser, OIDC")

Rel(Web UI / Tenant Portal, API Gateway, "HTTPS")
Rel(Admin Console, API Gateway, "HTTPS")
Rel(API Gateway, Identity Provider, "OIDC flows, token validation")

Rel(API Gateway, Tenant Service, "mTLS, JWT")
Rel(API Gateway, Billing Service, "mTLS, JWT")
Rel(API Gateway, Config & Feature Flags, "mTLS, JWT")
Rel(API Gateway, Usage & Metering, "mTLS, JWT")
Rel(API Gateway, Notifications, "mTLS, JWT")
Rel(API Gateway, Audit & Compliance, "mTLS, JWT")
Rel(API Gateway, SaaS Core Metadata, "mTLS, JWT")
Rel(API Gateway, AI Orchestration, "mTLS, JWT")

Rel(Tenant Service, Service Bus, "publish/subscribe")
Rel(Billing Service, Service Bus, "publish/subscribe")
Rel(Config & Feature Flags, Service Bus, "publish/subscribe")
Rel(Usage & Metering, Service Bus, "publish/subscribe")
Rel(Notifications, Service Bus, "consume events")
Rel(Audit & Compliance, Service Bus, "consume events")
Rel(AI Orchestration, Service Bus, "publish/subscribe")

Rel(Tenant Service, Azure SQL, "ADO.NET")
Rel(Billing Service, Azure SQL, "ADO.NET")
Rel(Config & Feature Flags, Azure SQL, "ADO.NET / Redis")
Rel(Usage & Metering, Azure SQL, "ADO.NET")
Rel(Notifications, MongoDB (opt), "Drivers")
Rel(Audit & Compliance, Azure SQL, "Append-only")
Rel(AI Orchestration, Storage, "Artifacts")

Rel(All, Observability Stack, "OTLP (traces/metrics/logs)")
Rel(Jobs Runtime, Service Bus, "Job queues, DLQ replay")
Hold "Alt" / "Option" to enable pan & zoom

Runtime Topologies

AKS (Kubernetes)

  • Best for large multi-tenant scale, advanced networking, service mesh, and fine-grained autoscaling.
  • Supports mTLS via service mesh, network policies, and workload identities.
  • Suited for Enterprise and products requiring custom sidecars or high-throughput messaging.

ACA (Azure Container Apps)

  • Simpler operational footprint, KEDA-native autoscaling on HTTP/queue metrics.
  • Excellent for SMB/Standard offerings, jobs, workers, and bursty workloads.
  • Can coexist with AKS as a jobs plane (DLQ replayers, batch processors).

Both variants keep the contract and container boundaries identical; choice is an operational concern decided per environment or product.


Network & Trust Boundaries

  • Public Edge: Front Door/WAF terminates TLS; only the Gateway is internet-exposed.
  • Private App Network: All services run in private subnets with deny-by-default policies.
  • Data Network: Databases and storage are accessible via Private Link; no public endpoints.
  • Observability & CI/CD: Access via managed identities and least privilege; audit logs immutable.

Trust boundaries are drawn at the Gateway edge and between Workload Plane ↔ Data Plane. Every call across a boundary uses mTLS and JWT validation (for user/service identity).


Identity & Secrets

  • User/Client Identity: OIDC tokens issued by OpenIddict/Azure AD, including tenantId, edition, entitlements.
  • Workload Identity: Managed identity for services; avoid secret injection; Key Vault for any remaining secrets.
  • Policy Enforcement: Scope/role checks at the gateway and policy filters in services (RBAC/ABAC).
  • Key Rotation: Automated; rotation ≤ 90 days baseline.

Data & Storage Layer

  • Primary: Azure SQL with tenant guards (RLS or repository filters), strict schema ownership per context.
  • Optional: MongoDB for notification templates/payloads; Redis for low-latency flag evaluation and token claims.
  • Artifacts: Blob storage for exports, audit bundles, and AI scaffolding outputs.
  • Retention: Defaults per context (e.g., Audit 7y, Usage raw 90d + aggregates).

Messaging Layer

  • Azure Service Bus: Topics/queues for domain events, sagas, and DLQs.
  • MassTransit: Implements outbox/inbox, retry with jitter, and saga coordination.
  • Contracts: Canonical events with versioning; idempotency keys; signed webhook exports at the edge.

Observability Plane

  • OpenTelemetry Everywhere: Traces, logs, metrics with required attributes (traceId, tenantId, edition).
  • Dashboards: Per context and per tenant/edition views; error budgets and SLO tracking.
  • Log Hygiene: Structured JSON, PII redaction by default, correlation with audit records.

Jobs & Scheduling

  • Recurring/Scheduled: Hangfire or ACA jobs with UTC cron; idempotent job keys.
  • Event-Driven: KEDA scales workers off queue length and lag (DLQ replayers, compactions).
  • Observability: Job success/failure metrics, run durations, and retries are first-class signals.

High Availability & Scaling

  • Stateless Services: Horizontal Pod Autoscaler (AKS) or KEDA (ACA).
  • Stateful Stores: Active geo-replication (SQL), zone-redundant storage, and backup/restore runbooks.
  • Multi-Tenancy Scaling: Edition-aware quotas; per-tenant throttling at the gateway and policy-based limits in services.
  • Blue/Green & Canary: Gateway routes plus deployment strategies to minimize risk.

Failure Modes & Recovery (selected)

  • Gateway Degradation: Fail closed for auth; serve static maintenance page via Front Door.
  • Bus Backlog: KEDA autoscale; overflow to DLQ; DLQ replay jobs with circuit breakers.
  • DB Hot Partition: Trigger tenant sharding or schema-per-tenant promotion per policy.
  • IdP Outage: Use cached tokens within acceptable TTL; degrade gracefully for non-critical flows.

Solution Architect Notes

  • Start deployments with ACA for simplicity; promote to AKS where fine-grained control or mesh features are required.
  • Keep the Gateway thin; push business decisions into services and policy layers.
  • Enforce mTLS + workload identity ubiquitously; secrets are exceptions, not norms.
  • Make observability non-negotiable: traces/logs/metrics must ship before exposing any public endpoint.

Edge & API Gateway

Purpose

Establish a secure, policy-driven ingress that terminates public traffic, authenticates at the edge, resolves tenants, enforces edition-aware quotas, and steers requests to the correct backend services. The gateway is a custom .NET Core solution built on YARP (reverse proxy) so we can embed ConnectSoft’s tenancy/security/observability invariants and progressive-delivery controls (canary, blue/green).


Ingress Topology & Trust Boundaries

  • Public Edge: Azure Front Door + WAF (TLS 1.3, DDoS protection) → Gateway (internet-facing).
  • Private App Network: Gateway to services over mTLS within a private VNet.
  • Identity Boundary: Gateway is the primary policy enforcement point; it validates OIDC tokens and workload identities, and stamps downstream calls with normalized headers and claims.
  • Observability Boundary: Gateway attaches mandatory correlation and tenancy headers (e.g., x-trace-id, x-tenant-id, x-edition) and emits OTel spans.
sequenceDiagram
  participant C as Client
  participant F as Front Door/WAF
  participant G as API Gateway (.NET+YARP)
  participant I as Identity (OpenIddict/AAD)
  participant S as Backend Service

  C->>F: HTTPS request
  F->>G: Forward with TLS
  G->>I: Validate token / challenge if needed
  I-->>G: JWT (tenantId, edition, scopes)
  G->>G: Tenant resolution (host/header/token), rate-limit, authZ
  G->>S: mTLS + normalized headers (traceId, tenantId, edition)
  S-->>G: Response
  G-->>C: Response (transforms, caching hints)
Hold "Alt" / "Option" to enable pan & zoom

Request Lifecycle (Edge Policies)

  1. Transport & TLS: Front Door terminates public TLS; Gateway re-terminates internally and enforces HSTS and strict cipher suites.
  2. Authentication: OIDC bearer validation at the gateway; anonymous routes explicitly whitelisted (e.g., /public/*, webhook callbacks).
  3. Tenant Resolution: Precedence Header (x-tenant-id) → Hostname subdomain → Token claim; rejects ambiguous/missing tenancy.
  4. Authorization: RBAC/ABAC decision at edge when possible (scope/role/edition checks); fine-grained decisions may be delegated with policy headers.
  5. Quota & Rate Limiting: Edition-aware token-bucket limits with leaky-bucket smoothing; per-tenant counters.
  6. Routing & Versioning: Path, header, or media-type versioning (e.g., /v1/..., Accept: application/vnd.connectsoft.v2+json).
  7. Transforms: Add/remove/normalize headers, response shape harmonization for legacy clients.
  8. Progressive Delivery: Weighted routing for canary and blue/green; circuit-breakers and outlier detection.
  9. Observability: OTel spans with traceId, spanId, tenantId, edition, routeId, backendId; structured logs with PII redaction.

Authentication & Authorization at the Edge

  • Tokens: OIDC JWT with mandatory claims: sub, tenantId, edition, scp (scopes), roles[], entitlements{}.
  • Service-to-Service: Gateway accepts mTLS and/or signed JWT from trusted callers (jobs, webhooks) and issues downstream identities via signed headers plus mTLS to services.
  • Anonymous Access: Explicit allow-list (e.g., health, well-known, webhook receiver). All else requires valid token.
  • Policy Evaluation: Coarse-grained at edge (block early), fine-grained within services (resource-level ABAC). Deny-by-default.

Edition-Aware Rate Limits & Quotas (defaults)

Edition Global RPM (per tenant) Burst Concurrency Notes
Free 60 120 10 Trial-friendly; strict retries
Standard 600 1200 50 Typical workloads
Enterprise 3000 6000 200 Prioritized queues & support
Custom Configurable Set via edition metadata

Enforced at edge; mirrored by backend safeguards to prevent “quota bypass.”


Routing, Versioning & Canary

  • Routes: Path-based (/api/tenants/*), tag-based (x-product, x-context), and method-aware (PCI-safe rules for billing).
  • Versioning: Path or content-negotiation; gateway validates supported versions and forwards x-api-version downstream.
  • Canary / Blue-Green: Weighted clusters (e.g., 90/10) and header-based targeting for internal testers (x-canary: true). Automatic rollback on SLO breach (latency/error-rate thresholds).
  • Shadow Traffic (optional): Duplicate a fraction of reads to a shadow backend for safe testing without client impact.

Request/Response Transformations (examples)

  • Inbound: Strip hop-by-hop headers; enforce x-tenant-id; normalize Accept and Content-Type; inject x-correlation-id when missing.
  • Outbound: Map backend error envelopes to a factory-standard error schema; attach cache hints for GETs; remove internal headers.

Resilience at the Edge

  • Timeouts: Sensible route-level timeouts (e.g., 2s internal, 5s external).
  • Retries: Idempotent methods only (GET/HEAD/OPTIONS) with exponential backoff + jitter.
  • Circuit Breakers: Per-destination outlier detection; automatic ejection and gradual recovery.
  • Backpressure: 429 with Retry-After and tenant-specific hints; shed load for Free first.

Observability (Gateway Signals)

  • Traces: gateway.request, gateway.route.resolve, gateway.authn, gateway.authz, gateway.ratelimit, gateway.proxy.
  • Metrics: Requests/sec by route/tenant/edition, p95/p99 latency, upstream error rate, ejected destinations, rate-limit hits, auth failures.
  • Logs: Redacted request/response summaries with route, tenant, edition, traceId; WAF correlation.

Configuration Templates (illustrative)

YARP Routes & Clusters (weighted canary + transforms)

{
  "ReverseProxy": {
    "Routes": [
      {
        "RouteId": "tenant-api",
        "Match": { "Path": "/api/tenants/{**catch-all}" },
        "Transforms": [
          { "RequestHeader": "X-Correlation-Id", "Set": "{TraceId}" },
          { "RequestHeader": "X-Tenant-Id", "Set": "{TenantId}" },
          { "RequestHeaderOriginalHost": "true" }
        ],
        "ClusterId": "tenant-svc",
        "AuthorizationPolicy": "RequireAuthenticatedUser",
        "RateLimiterPolicy": "EditionAwarePolicy",
        "CorsPolicy": "Default"
      }
    ],
    "Clusters": {
      "tenant-svc": {
        "Destinations": {
          "stable": { "Address": "http://tenant-svc-v1/" },
          "canary": { "Address": "http://tenant-svc-v2/" }
        },
        "LoadBalancingPolicy": "PowerOfTwoChoices",
        "SessionAffinity": { "Enabled": false },
        "HealthCheck": { "Passive": { "Enabled": true } },
        "Metadata": { "CanaryWeights": "stable=90;canary=10" }
      }
    }
  }
}

.NET Rate Limiting (edition-aware)

builder.Services.AddRateLimiter(options =>
{
    options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;
    options.AddPolicy("EditionAwarePolicy", context =>
    {
        var edition = context.User?.FindFirst("edition")?.Value ?? "Free";
        var (permit, replen, burst) = edition switch
        {
            "Enterprise" => (3000, TimeSpan.FromMinutes(1), 6000),
            "Standard"   => (600,  TimeSpan.FromMinutes(1), 1200),
            _            => (60,   TimeSpan.FromMinutes(1), 120),
        };
        return RateLimitPartition.GetTokenBucketLimiter(
            partitionKey: $"{edition}:{context.Request.Headers["x-tenant-id"]}",
            factory: _ => new TokenBucketRateLimiterOptions
            {
                TokenLimit = burst,
                TokensPerPeriod = permit,
                ReplenishmentPeriod = replen,
                AutoReplenishment = true,
                QueueLimit = 0
            });
    });
});

Program.cs (YARP + OIDC + OTel + mTLS enforcement)

builder.Services.AddAuthentication("Bearer")
    .AddJwtBearer("Bearer", o =>
    {
        o.Authority = builder.Configuration["IdP:Authority"];
        o.TokenValidationParameters.ValidateAudience = false;
        o.MapInboundClaims = false;
    });

builder.Services.AddReverseProxy().LoadFromConfig(builder.Configuration.GetSection("ReverseProxy"));
builder.Services.AddOpenTelemetry().WithTracing(t => t.AddAspNetCoreInstrumentation().AddHttpClientInstrumentation());
app.UseAuthentication();
app.Use(async (ctx, next) =>
{
    // Enforce mTLS from Front Door private link or internal LB
    if (!ctx.Connection.ClientCertificate?.Verify() ?? true)
    {
        ctx.Response.StatusCode = StatusCodes.Status403Forbidden;
        return;
    }
    await next();
});
app.UseRateLimiter();
app.MapReverseProxy();

Failure Modes & Playbooks (selected)

  • Token Validation Failures: Return 401/WWW-Authenticate; verify IdP health; enable cached signing keys with TTL; fail closed for sensitive routes.
  • Backend Saturation: Trigger circuit breaker; reduce canary weight to 0; raise 429 with Retry-After.
  • Route Drift / Misconfig: Config lint & contract tests in CI; runtime config reload guarded by feature flag; instant rollback to last-known-good.
  • Tenant Ambiguity: Reject with 400 + problem details; provide diagnostic trace; require explicit x-tenant-id or correct host.

Solution Architect Notes

  • Keep the gateway thin but policy-rich: authN/Z, tenancy, quotas, and traffic shaping; no business logic.
  • Prefer header-based canary steering for internal testing and percentage-based for public rollouts.
  • Make edition-aware rate limiting visible to tenants via headers and usage endpoints.
  • Treat the gateway as a security product: frequent pen tests, strict dependency hygiene, and SBOM/signing in CI.

Identity, Authentication & Authorization

Purpose

Provide a unified, multi-tenant identity plane for users and services. Standardize OAuth2/OIDC flows, token and claim design, RBAC/ABAC policy enforcement, and workload identity for service-to-service calls. Support both a custom .NET (OpenIddict) Identity Provider and external IdPs (Azure AD/Okta) behind a federation boundary.


Trust Boundaries & High-Level Flows

flowchart LR
  U[User/Client App] -->|OIDC| G[Edge Gateway]
  G --> IdP[Identity Provider (OpenIddict/AAD)]
  G --> Svc[Backend Services]
  IdP -->|JWT (tenantId, edition, scopes, roles)| G
  G -->|mTLS + normalized identity headers| Svc
  Svc -->|policy check (RBAC/ABAC)| Svc

  subgraph Identity Plane
    IdP
  end
  subgraph Workload Plane
    Svc
  end
  classDef boundary stroke-width:2,stroke:#999
Hold "Alt" / "Option" to enable pan & zoom

Boundaries

  • Public boundary: Clients ↔ Gateway (OIDC/OAuth2); WAF + TLS 1.3.
  • Control boundary: Gateway ↔ Services (mTLS + JWT, workload identity).
  • Identity boundary: Gateway trusts IdP token-signing keys; services trust gateway-issued identity context and validate tokens again internally.

Identity Provider Options

Primary (factory-default): Custom .NET IdP using OpenIddict

  • Supports Authorization Code + PKCE, Client Credentials, Device Code (optional), and Refresh Tokens.
  • Multi-tenant claim issuance; edition/entitlement enrichment from SaaS Core Metadata.
  • Local user store (ASP.NET Identity) plus federation with external IdPs.
  • SCIM 2.0 (Enterprise) for just-in-time provisioning and deprovisioning.

Federated (enterprise option): Azure AD / Okta

  • External IdP via OIDC federation; Federation ACL maps external attributes to internal roles/scopes.
  • Supports SSO, conditional access, MFA, and B2B invites.

Token Model (JWT)

Standard claims

  • sub, iss, aud, exp, iat, nbf

Multi-tenant & edition claims

  • tenantId: authoritative tenant identifier (issued by Tenant Management)
  • edition: plan identifier (e.g., Free, Standard, Enterprise, or custom)
  • entitlements: bag of feature flags/limits at issuance time (digest, not the source of truth)

Authorization claims

  • scp (scopes): API permissions (coarse-grained)
  • roles: high-level roles (e.g., tenant_admin, support_agent, billing_admin)
  • abac: optional attribute set for policy engines (e.g., {"region":"EU","dataClass":"PII"})

Service identity

  • Client Credentials flow issues tokens for service principals with appId, aud, and minimal scopes.
  • Downstream calls are authenticated with mTLS and validated JWT, with identity context propagated via headers (x-tenant-id, x-actor, x-roles, x-scope).

Lifetimes (defaults)

  • Access token: 15 minutes
  • Refresh token: 24 hours (rotating)
  • Client credentials token: 10 minutes
  • Key rotation: ≤ 90 days (automated), JWKS exposed

RBAC / ABAC Authorization

RBAC (role-based)

  • Roles assigned per tenant; example roles: tenant_admin, member, billing_admin, support_agent, operator.
  • Services enforce role gates for administrative operations.

ABAC (attribute-based)

  • Policies evaluate attributes from token + request context (tenant, edition, resource owner, region, data class).
  • Example: “Users with role support_agent may read logs only when tenant.support_access=true and dataClass != PII.”

Hybrid

  • Coarse authorization at the gateway (scopes/roles).
  • Fine-grained authorization inside services (ABAC over resource attributes).

Scope Catalog (illustrative)

Scope Audience Description Typical Roles
tenant.read Tenant API Read tenant profile & settings tenant_admin, support_agent
tenant.manage Tenant API Create/Update/Delete tenant resources tenant_admin
billing.read Billing API Read subscriptions/invoices billing_admin, tenant_admin
billing.manage Billing API Modify plans, payment methods billing_admin
config.read Config API Read flags and settings member, tenant_admin
config.manage Config API Create/Update flags, overrides tenant_admin
usage.read Usage API Read metering/quota tenant_admin, support_agent
audit.read Audit API Query audit logs tenant_admin, operator
notify.send Notify API Send messages (templatized) tenant_admin
ai.orchestrate AI API Invoke agentic flows/tools tenant_admin, engineer

Scopes are additive; deprecations follow a sunset policy. Services validate both scope and role where applicable.


Login & Token Issuance (sequence)

sequenceDiagram
  participant B as Browser/App
  participant G as Gateway
  participant I as IdP (OpenIddict/AAD)
  participant M as Metadata (Products/Entitlements)

  B->>G: /authorize
  G->>I: OIDC Auth Code + PKCE
  I-->>B: auth code
  B->>G: code exchange
  G->>I: token request
  I->>M: enrich claims (tenant, edition, entitlements)
  M-->>I: entitlement snapshot
  I-->>G: id_token + access_token + refresh_token
  G-->>B: session established (SPA stores tokens securely)
Hold "Alt" / "Option" to enable pan & zoom

Notes

  • Enrichment pulls current edition/entitlements at issuance; services must still consult Config for real-time flag evaluation.
  • PKCE & MFA recommended for all first-party SPAs and public clients.

Tenant Resolution & Federation

  • Resolution precedence: x-tenant-id header → subdomain → token claim. Gateway rejects ambiguous requests.
  • Federation: Enterprise tenants may authenticate via external IdPs; federation ACL translates external groups/claims into internal roles/scopes.
  • SCIM (Enterprise): Automates user/role provisioning; deprovision triggers session revocation.

Workload Identity (service-to-service)

  • Managed Identity (AKS/ACA) binds workloads to identities; outbound calls signed at transport (mTLS) and application (JWT).
  • No static secrets in services; Key Vault for exceptional credentials (e.g., third-party webhooks).
  • Downstream identity propagation: services forward correlation and minimal identity context; avoid token forwarding unless necessary.

Security & Privacy Controls

  • Zero Trust: deny-by-default, least privilege, explicit allow-lists for anonymous routes.
  • mTLS: gateway↔service and service↔service; certificate pinning where feasible.
  • PEP/PDP separation: gateway acts as Policy Enforcement Point; services host Policy Decision logic for resource-level checks.
  • PII safety: never write raw PII to logs; redaction at sinks; audit every elevation (admin actions).
  • Consent & Terms: first-class records per tenant; tracked in Audit.

Observability Signals

  • Auth signals: token issuance latency, failed validations, JWKS fetch errors.
  • Access signals: authz denials by route/scope/role, edition-policy mismatches.
  • Federation: IdP health, SCIM drift (orphaned accounts), SSO error rates.
  • Secrets/Certs: rotation age, expiring keys/certs, failed rotations (SEV-1).

Failure Modes & Mitigations

  • IdP outage: cached signing keys and grace tokens for short read-only windows; degrade non-critical flows.
  • Clock skew: NTP enforcement; leeway on nbf/exp validation (≤ 2 minutes).
  • Stale entitlements: tokens carry snapshots; Config is source of truth for runtime decisions; short access-token lifetimes reduce drift.
  • Compromised refresh token: rotate on every use; maintain reuse detection; revoke sessions on suspicion.

Solution Architect Notes

  • Prefer OpenIddict for first-party control and rapid feature iteration; use federation to honor enterprise SSO requirements without coupling domain models to external IdPs.
  • Keep tokens small and short-lived; push dynamic decisions to Config and policy engines.
  • Enforce workload identity + mTLS ubiquitously; treat any secret-based fallbacks as temporary waivers with expiry.
  • Model authorization outside the UI; all decisions must be verifiable at API boundaries and auditable.

Multi-Tenancy Strategy

Purpose

Define how the platform identifies, isolates, and governs tenants across the stack. This section standardizes tenant resolution, isolation levels (pooled/schema/database), configuration/flags enforcement, and onboarding & migration flows so products can scale safely from trials to large enterprises without redesign.


Tenancy Model Overview

  • Tenant as first-class identity: every request, job, event, and data row is associated with exactly one tenantId (or an allowed system actor).
  • Edition-aware policies: quotas, features, and SLO overlays are resolved at runtime per tenant.
  • Security & observability invariants: tenant context is mandatory at ingress, persisted with data, and present on all telemetry.

Tenant Resolution

Resolution precedence (strict):

  1. Headerx-tenant-id (authoritative in service-to-service calls)
  2. Hostsubdomain.example.comtenantId mapping
  3. Path/t/{tenantId}/... (supported for specific APIs)
  4. Token — JWT claim tenantId (validated but not preferred for multi-tenant APIs)

If the gateway detects ambiguity or mismatch (e.g., header vs host disagree), the request is rejected with a 400 including problem details and a correlation ID.

Resolver flow (edge):

flowchart LR
  A[Request Arrives] --> B{Has x-tenant-id?}
  B -- Yes --> C[Validate & Normalize Id]
  B -- No --> D{Subdomain present?}
  D -- Yes --> E[Lookup mapping -> tenantId]
  D -- No --> F{Path /t/{id}?}
  F -- Yes --> C
  F -- No --> G{Token has tenantId?}
  G -- Yes --> C
  G -- No --> X[Reject 400: tenant_ambiguous]
  C --> H{Tenant active & region allowed?}
  H -- Yes --> I[Attach tenant to context, continue]
  H -- No --> X
Hold "Alt" / "Option" to enable pan & zoom

Isolation Levels

Isolation Description When to Use Data Guarding Strengths Trade-offs
Pooled Shared schema + tables, tenantId column on all rows Trials, SMB, moderate scale Repo guards + RLS (Row-Level Security) Highest density, lowest cost Hot-tenant contention; noisy neighbor risk
Schema-per-Tenant Dedicated schema per tenant in same DB Mid-market, heavier customizations Schema scoping + connection factory Easier per-tenant backup/restore; reduced contention Higher catalog bloat; ops overhead
Database-per-Tenant Dedicated database/server per tenant Enterprise, regulatory isolation Network isolation + DB-level IAM Strongest isolation; independent lifecycle Highest cost; cross-tenant reporting complexity

Promotion path: pooled → schema → database, triggered by tenant size, SLO breach risk, or regulatory needs. Promotions are online using CDC-based sync and dual-writes during cutover (see Migration).


Tenancy Enforcement (defense in depth)

Layer Enforcement Mechanism Mandatory Checks
Gateway Resolver → inject x-tenant-id; deny ambiguous; edition-aware rate limit Token validation, tenant status (active/suspended), region allow-list
Service API Policy filters & guards Require tenant in context; cross-tenant IDs rejected
Domain Logic Tenant-scoped commands/queries Invariants include tenantId; never accept client-provided cross-tenant references
Repository/DAL RLS or tenant filters; parameterized queries WHERE tenant_id = @tenantId always; no string-concatenated SQL
Messaging Envelope headers (tenantId, traceId, edition); scoped consumers Consumers reject missing/foreign tenant headers; per-tenant DLQ segregation
Cache Tenant-scoped keys cache:{tenantId}:{key}; no shared mutable data
Storage/Blob Tenant prefix & ACLs tenants/{tenantId}/...; private containers; tenant KMS policies (optional)
Observability Required attributes on spans/logs/metrics tenantId, edition, traceId present; queries default to tenant scope

Configuration, Flags & Entitlements

  • Resolution order: platform defaults → product defaults → edition pack → tenant overrides → (optional) user context.
  • Flag evaluation: low-latency via cache (Redis) with consistent hashing; cache entries are tenant-scoped and short-lived.
  • Entitlements in tokens: treated as snapshots; definitive decision uses Config at request time for drift-free enforcement.

Data Residency & Regional Routing

  • Residency attribute on tenant (e.g., EU-WEST, US-EAST) selected at onboarding or via enterprise contract.
  • Routing at gateway directs requests to the region’s workload plane; cross-region access is denied unless an explicit policy allows it.
  • Data stores are regionally isolated with Private Link; cross-region replication follows DR policy (RPO/RTO).

Onboarding & Lifecycle

States: requested → provisioning → active → suspended → deleted

sequenceDiagram
  participant U as Tenant Admin
  participant GW as Gateway
  participant TEN as Tenant Mgmt
  participant META as SaaS Core Metadata
  participant CONF as Config/Flags
  participant BILL as Billing
  participant IDP as Identity

  U->>GW: Sign up / create tenant
  GW->>TEN: create_tenant(request)
  TEN->>META: seed_product_edition(entitlements)
  TEN->>CONF: seed_default_flags(tenantId, edition)
  TEN->>BILL: create_subscription(plan)
  BILL-->>TEN: subscription.pending
  TEN->>IDP: provision_realm(tenant claims, roles)
  TEN-->>GW: provisioning_complete
  Note over TEN: State = active
  GW-->>U: Activation success + admin invite
Hold "Alt" / "Option" to enable pan & zoom

Suspend/Resume/Delete

  • Suspend → revoke sessions; freeze subscription; block writes (read-only mode optional).
  • Delete → two-phase: soft-delete with grace → hard-delete (after retention/erasure workflows).

Migration & Promotion Flows

Use cases

  1. Hot tenant promotion from pooled → schema → database.
  2. Region move for residency or latency.
  3. Edition-driven data shape change (e.g., enabling advanced features).

Approach (pooled → schema/db):

  • Prepare: Create target schema/DB; provision IAM and RLS.
  • Sync: Enable CDC or change feed; backfill historical data; start dual-writes.
  • Cutover: Drain inflight ops; flip tenant connection mapping at resolver; verify read/write health.
  • Finalize: Disable dual-writes; decommission old partition after retention window.

Zero-downtime guardrails

  • Idempotent writes; natural keys stable across partitions.
  • All services obtain tenant-specific connection info via Tenant Directory cache (with TTL and fast invalidation).
  • Feature flag tenant.migration.read_only toggles to protect critical sections.

Observability & SLOs (tenancy-centric)

Signal Target Notes
Tenant onboarding p95 ≤ 60s create → active
Resolver failure rate < 0.01% ambiguous/missing tenant
Cross-tenant access violations 0 treated as SEV-1
Promotion cutover duration ≤ 60s dual-write window bounded
Flag evaluation latency p95 ≤ 5 ms local/Redis-backed

Dashboards include per-tenant views for latency, error rates, quota consumption, and migration progress.


Security & Privacy Notes

  • Tenant authority lives in Tenant Management; all other contexts validate inbound tenantId against their read model.
  • No cross-tenant joins in read models unless explicitly marked “multi-tenant analytics” and routed through safe aggregation pipelines.
  • Erasure support: tenant-owned PII deletions orchestrated via workflow; audit remains immutable with tokenized references.

Failure Modes & Mitigations

  • Ambiguous resolution: 400 with diagnostics; require explicit header; emit audit event.
  • Noisy neighbor: edition-aware throttling at edge; per-tenant queue partitioning; promote isolation level.
  • Stale connection mapping: short TTL + cache bust on migration; fallback to directory lookup.
  • Cross-tenant leak bug: automated tenant-fence tests in CI; runtime guard that verifies returned rows belong to request tenant (sample-based).
  • Region outage: failover only for tenants whose contracts permit cross-region DR; others remain isolated per residency policy.

Solution Architect Notes

  • Start all tenants pooled; automate promotion paths and keep them routine—not exceptional.
  • Prefer RLS where supported; otherwise enforce repository guards and property-based testing to prove scoping.
  • Keep the Tenant Directory authoritative for connection/partition info; never embed static routing in code.
  • Treat tenant context as non-optional telemetry—it’s the first dimension for debugging, scaling, and support.

Event-Driven Backbone & Contracts

Purpose

Establish a canonical, versioned event backbone that connects bounded contexts with loose coupling and reliable delivery. Standardize the event envelope, headers, topics/queues, outbox/inbox patterns, idempotency, and DLQ handling, so teams can ship independently while maintaining a stable integration surface.


Principles

  • Event-first collaboration: Services publish domain facts; consumers react and build local read models.
  • Canonical envelope: All events carry the same required headers; payloads are versioned and backwards-compatible.
  • At-least-once + idempotency: Producers use outbox; consumers use inbox + idempotency keys.
  • Tenant isolation: Events are tenant-scoped by default; cross-tenant payloads are prohibited unless flagged as aggregate analytics.
  • Observable by design: Every event includes telemetry context and is trace-linked to causative actions.

Canonical Envelope (CloudEvents-aligned)

Headers (required)

  • type — semantic event name with version suffix, e.g., tenant.created.v1
  • id — globally unique event id (ULID/GUID)
  • source — service/bounded-context name, e.g., tenant-svc
  • specversion1.0
  • time — RFC3339 timestamp
  • traceId — W3C traceparent correlation id
  • tenantId — authoritative tenant identity
  • edition — edition at the time of emission (snapshot)
  • schemaVersion — semantic version of the data payload (e.g., 1.0.0)
  • partitionKey — default tenantId (for ordering at consumer/queue level)
  • keyidempotency key for the business entity / sequence (e.g., subscription id)

Payload (data)

  • Domain-specific fields (no PII unless absolutely required; prefer references and lookups).

Event Bus Topology

flowchart LR
  subgraph Producers
    TEN[Tenant Svc] -->|Outbox| EB(Service Bus Topics)
    BILL[Billing Svc] -->|Outbox| EB
    CONF[Config Svc] -->|Outbox| EB
    USE[Usage Svc] -->|Outbox| EB
    IDP[Identity Svc] -->|Outbox| EB
    NOTIF[Notifications] -->|Outbox| EB
    AUD[Audit Svc] -->|Outbox| EB
  end

  EB -->|Subscriptions| TEN_SUB[Tenancy Subscriptions]
  EB --> BILL_SUB[Billing Subscriptions]
  EB --> CONF_SUB[Config Subscriptions]
  EB --> USE_SUB[Usage Subscriptions]
  EB --> AUD_SUB[Audit Archive]
  EB --> NOTIF_SUB[Delivery Workers]

  classDef svc fill:#0b6,stroke:#094,color:#fff;
  classDef bus fill:#234,stroke:#123,color:#fff;
  classDef sub fill:#357,stroke:#234,color:#fff;
Hold "Alt" / "Option" to enable pan & zoom

Conventions

  • Topic-per-domain (e.g., tenant-events, billing-events, config-events, usage-events, identity-events, notifications-events, audit-events).
  • Subscription-per-consumer with optional filters (SQL filters on type, tenantId, edition).
  • DLQ per subscription; DLQ contents are immutable and auditable.

Versioning Strategy

  • Event type includes a major payload version (.v1, .v2).
  • Additive changes (new fields) do not bump major; consumers must be tolerant readers.
  • Breaking changes create a new type (tenant.created.v2). Old and new may coexist during migration.
  • Deprecation window announced in contracts; observability verifies consumer adoption.

Outbox / Inbox / Idempotency

Outbox (producer)

  • Transactionally stores pending events with business state changes.
  • Background dispatcher publishes to bus with retry/backoff and exactly-once handoff semantics to the bus (effectively at-least-once end-to-end).

Inbox (consumer)

  • Stores processed event ids/keys to de-duplicate.
  • Idempotency key chosen per aggregate (e.g., subscriptionId, userId, flagName@version).

Reentrancy rules

  • Handlers must be idempotent and side-effect-safe.
  • Use sagas for long-running processes; each step commits with an idempotency boundary.

DLQ & Replay

  • DLQ contracts: poison messages are never modified; metadata records the failures and handler stack.
  • Replay tools: Operator-driven jobs pull DLQ batches → run through isolation workers with circuit breakers and quarantine on repeated failure.
  • Observability: DLQ depth, age, and replay success rate are first-class metrics.
  • Retention: DLQs retained ≥ 30 days (configurable), Audit retains summary references.

Sample Events (JSON)

1) Tenant Created

{
  "type": "tenant.created.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C4D",
  "source": "tenant-svc",
  "specversion": "1.0",
  "time": "2025-09-29T10:15:30Z",
  "traceId": "00-7e0d...-01",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "t-123",
  "data": {
    "name": "Acme Ltd",
    "region": "EU-WEST",
    "ownerUserId": "u-789"
  }
}

2) Tenant Updated

{
  "type": "tenant.updated.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C5E",
  "source": "tenant-svc",
  "specversion": "1.0",
  "time": "2025-09-29T10:17:00Z",
  "traceId": "00-7e0d...-02",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "t-123",
  "data": {
    "changes": {
      "edition": { "old": "Free", "new": "Standard" }
    }
  }
}

3) Subscription Activated

{
  "type": "billing.subscription.activated.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C6F",
  "source": "billing-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:00:00Z",
  "traceId": "00-9a1c...-01",
  "tenantId": "t-123",
  "edition": "Enterprise",
  "schemaVersion": "1.1.0",
  "partitionKey": "t-123",
  "key": "sub-5566",
  "data": {
    "subscriptionId": "sub-5566",
    "plan": "Enterprise",
    "startDate": "2025-10-01"
  }
}

4) Usage Meter Recorded

{
  "type": "usage.meter.recorded.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C7G",
  "source": "usage-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:05:12Z",
  "traceId": "00-bb2d...-03",
  "tenantId": "t-123",
  "edition": "Enterprise",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "meter:api_calls:2025-09-29T11:05:00Z",
  "data": {
    "meter": "api_calls",
    "amount": 37,
    "windowStart": "2025-09-29T11:05:00Z",
    "windowSizeSec": 60
  }
}

5) Config Flag Updated

{
  "type": "config.flag.updated.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C8H",
  "source": "config-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:10:00Z",
  "traceId": "00-cc3e...-01",
  "tenantId": "t-123",
  "edition": "Enterprise",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "flag:betaFeature",
  "data": {
    "flag": "betaFeature",
    "value": true,
    "actor": "u-42"
  }
}

6) User Invited

{
  "type": "identity.user.invited.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3C9I",
  "source": "identity-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:12:34Z",
  "traceId": "00-11aa...-01",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "u-42",
  "data": {
    "userId": "u-42",
    "email": "ada@example.com",
    "roles": ["member"]
  }
}

7) Notification Delivered

{
  "type": "notify.message.delivered.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3D0J",
  "source": "notifications-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:20:00Z",
  "traceId": "00-22bb...-05",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "msg-9012",
  "data": {
    "messageId": "msg-9012",
    "channel": "email",
    "template": "welcome",
    "status": "delivered"
  }
}

8) Audit Action Logged

{
  "type": "audit.action.logged.v1",
  "id": "01HZXZ0N4Q6T3V3Y1W1A2B3D1K",
  "source": "audit-svc",
  "specversion": "1.0",
  "time": "2025-09-29T11:22:10Z",
  "traceId": "00-33cc...-07",
  "tenantId": "t-123",
  "edition": "Standard",
  "schemaVersion": "1.0.0",
  "partitionKey": "t-123",
  "key": "audit:2025-09-29:01",
  "data": {
    "actor": "u-42",
    "action": "TENANT_UPDATE",
    "resource": "Tenant/t-123",
    "result": "success"
  }
}

Contract Governance

  • Source of Truth: contracts/events/*.json with schema definitions and examples.
  • Review Process: Producer PR must include schema update + changelog; consumer teams subscribe to contract watch alerts.
  • Linting: CI validates envelope headers, allowed field names, and version policies.
  • Deprecation: Old types remain for a defined window; producers publish both old/new until consumers confirm adoption.

Observability & SLOs

Signal Target
Event publishing success ≥ 99.99%
End-to-end lag p95 ≤ 60s (producer commit → consumer handle)
Replay success rate ≥ 99%
DLQ age p95 ≤ 15m
Duplicate handling incidents 0 (idempotent consumers)

Spans include producer.service, consumer.service, type, tenantId, key. Metrics cover topic depth, subscription lag, DLQ size/age, and handler error rates.


Failure Modes & Mitigations

  • Duplicate deliveries: Inbox + idempotent handlers; use entity key to ignore repeats.
  • Schema drift: Tolerant readers; canary consumers validated in pre-prod; contract tests in CI.
  • Bus outage/backlog: Producer backpressure, KEDA scale-out of consumers, DLQ thresholds with alerting.
  • Poison messages: Quarantine to DLQ on max attempts; root-cause analysis required before replay.
  • Cross-tenant leakage risks: Envelope validator rejects events without/with mismatched tenantId; audit every violation as SEV-1.

Solution Architect Notes

  • Prefer topic-per-domain plus subscription-per-consumer to keep ownership clear.
  • Treat events as write-optimized facts; resist synchronous request/response coupling.
  • Keep payloads lean and stable; link to resources rather than embedding large objects.
  • Make replay safe by ensuring handlers are pure functions over input + idempotent side effects.

Service Taxonomy & Interfaces

Purpose

Define the platform service catalog aligned with bounded contexts, including responsibilities, exposed interfaces (REST/events/webhooks), storage choices, SLO posture, and cross-cutting constraints (multi-tenancy, security, observability). Provide component-level sketches for representative services to guide implementation.


Catalog by Bounded Context

Context Service Core Responsibility Interfaces Persistence (default) Notes
SaaS Core Metadata Metadata API Products, editions, features, entitlements; pack composition REST (admin), Events Azure SQL Upstream for Config, Billing, Identity
Identity IdP (OpenIddict) OIDC/OAuth2, roles/scopes, federation OIDC, REST, Events Azure SQL SCIM (Enterprise)
Tenant Management Tenant API, Tenant Worker Tenant lifecycle, residency, directory, promotions REST, Events, Jobs Azure SQL Authoritative tenantId
Billing Billing API, Billing Saga Plans, subscriptions, invoices, payment provider ACL REST, Events, Webhooks Azure SQL Sagas orchestrate payments
Config & Feature Flags Config API, Flag Evaluator Flags, edition overrides, kill switches REST, Events Azure SQL + Redis Low-latency evaluation
Usage & Metering Usage Ingest, Usage Query Meter capture, quota checks, aggregates Events (in), REST (read) Azure SQL (+ cold storage) Emits usage.meter.recorded
Notifications Notify API, Delivery Worker Email/SMS/Webhooks, templates, branding REST, Events, Webhooks MongoDB + Queue Provider ACLs
Audit & Compliance Audit Append, Audit Query Immutable append-only log, exports Events (append), REST (read) Azure SQL Long retention
AI Orchestration AI Orchestrator, Agent Workers Agentic flows, scaffolding, test/doc generation REST, Jobs, Events Blob/Queue Guardrails & audit

Interface conventions

  • REST: external/administrative commands and strongly consistent queries.
  • Events: domain facts, outbox/inbox; topic-per-domain.
  • Webhooks: signed outbound notifications for tenant systems (Billing/Notify).
  • Jobs: KEDA/Hangfire for scheduled tasks and DLQ replayers.

Cross-Cutting Constraints

  • Multi-tenancy: All operations scoped by tenantId. Repository layer enforces RLS/guards.
  • Security: OIDC at edge, mTLS inside, least privilege, no static secrets (workload identity).
  • Observability: OTel spans/logs/metrics with tenantId, edition, traceId; dashboards per service.
  • SLO posture (baseline): 99.9% availability; p95 read ≤ 200 ms / write ≤ 350 ms; event lag p95 ≤ 60 s.

Exemplar 1 — Metadata API (SaaS Core Metadata)

Responsibilities

  • Manage Products, Editions, Features, Entitlements, Quotas, and Pack composition.
  • Act as source of truth for billing plans, entitlement catalogs, and edition overlays.
  • Publish canonical events when definitions change (e.g., product.updated, edition.created).

Exposed Interfaces

  • REST (admin/product owner):
    • POST /api/metadata/products (create product)
    • POST /api/metadata/editions (define edition)
    • POST /api/metadata/features (define feature/entitlement)
    • GET /api/metadata/products/{id}
  • Events (producer):
    • metadata.product.created.v1
    • metadata.edition.created.v1
    • metadata.feature.updated.v1

Storage

  • Azure SQL: relational model for products, editions, features, entitlements.
  • Immutable history tables for auditability.

Component Sketch

flowchart LR
  API[Metadata API] --> APP[App Layer / Validation]
  APP --> REPO[Metadata Repository]
  REPO --> SQL[(Azure SQL)]
  APP --> OUTBOX[Outbox Dispatcher]
  OUTBOX --> BUS[[Service Bus]]
  style API fill:#7d5dfc,stroke:#4b33b3,color:#fff
Hold "Alt" / "Option" to enable pan & zoom

Security & Tenancy

  • Administrative operations require metadata.manage scope, reserved for platform/product owners.
  • Queries may be global (cross-tenant), but tenant-facing tokens only receive read-scoped access to permitted metadata.

Observability

  • Spans: metadata.createProduct, metadata.updateEdition.
  • Metrics: product definition latency, cache hit ratio for entitlement lookups.

Exemplar 2 — Tenant Service (Tenant Management)

Responsibilities

  • Create/activate/suspend/delete tenants.
  • Manage residency, directory, isolation level (pooled/schema/db).
  • Seed defaults (edition, flags) and emit tenant.* events.

Exposed Interfaces

  • REST (admin):
    • POST /api/tenants (create)
    • PATCH /api/tenants/{id} (suspend/resume)
    • POST /api/tenants/{id}:promote (upgrade isolation level)
  • Events: tenant.created.v1, tenant.updated.v1, tenant.promoted.v1

Storage

  • Azure SQL (authoritative tenant table, residency, isolation level).
  • Redis (Tenant Directory cache).

Component Sketch

flowchart LR
  API[Tenant API] --> APP[Policies / Directory]
  APP --> REPO[Tenant Repository]
  REPO --> SQL[(Azure SQL)]
  APP --> OUTBOX[Outbox -> Bus]
  APP --> JOBS[Promotion Worker]
  JOBS --> SQL
  style API fill:#1f6,stroke:#0b5,color:#fff
Hold "Alt" / "Option" to enable pan & zoom

Exemplar 3 — Billing Service (with Saga Orchestrator)

Responsibilities

  • Manage plans/editions, subscriptions, invoicing.
  • Integrate with payment providers via Payment ACL.
  • Coordinate long-running billing flows (activation, retries, dunning) via Saga.

Exposed Interfaces

  • REST (admin/tenant):
    • POST /api/billing/subscriptions (create/upgrade/downgrade)
    • GET /api/billing/subscriptions/{id}
    • POST /api/billing/invoices/{id}:pay
  • Events (producer): billing.subscription.activated.v1, .suspended.v1, .invoiced.v1, .payment.received.v1
  • Events (consumer): usage.meter.recorded.v1, tenant.created.v1
  • Webhooks (outbound): signed invoice.created, payment.failed

Storage

  • Azure SQL (subscriptions, invoices, ledger)
  • Optional blob for invoice PDFs

Component Sketch (Saga)

flowchart TB
  CMD[Billing API] --> SM[Subscription Saga]
  SM --> ACL[Payment Provider ACL]
  SM --> OUTBOX[Outbox]
  SM --> SUBREPO[Subscription Repo]
  ACL --> PAYEXT[(Payment Gateway)]
  SUBREPO --> SQL[(Azure SQL)]
  OUTBOX --> BUS[[Service Bus]]
  BUS -->|usage.meter.recorded| SM
  style SM fill:#f8a,stroke:#c06,color:#222
  style BUS fill:#357,stroke:#234,color:#fff
Hold "Alt" / "Option" to enable pan & zoom

Saga Flow (high-level)

  1. Receive subscription.create.
  2. Reserve plan; request payment via ACL.
  3. On success: persist, publish subscription.activated.
  4. On failure: retry (exponential backoff) → dunning → subscription.suspended.

Security & Tenancy

  • All commands must include tenantId; ACL enforces signature verification with provider.
  • Monetary amounts validated server-side; no client-trusted totals.

Observability

  • Spans: billing.saga.step.*, payment.acl.request.
  • Metrics: authorization approval rate, dunning success rate, event lag.

Exemplar 4 — Config & Feature Flags Service

Responsibilities

  • Store and evaluate feature flags, edition overrides, kill switches.
  • Provide low-latency decision APIs for UI/services.
  • Broadcast changes as events for cache invalidation.

Exposed Interfaces

  • REST:
    • GET /api/config/flags/{key} (evaluate with context)
    • POST /api/config/flags (create/update)
    • POST /api/config/overrides (tenant/edition-specific)
  • Events (producer): config.flag.updated.v1, config.override.updated.v1
  • Events (consumer): billing.subscription.activated.v1, tenant.created.v1

Storage

  • Azure SQL (authoritative flag definitions, overrides)
  • Redis (evaluation caches with tenant scoping)

Component Sketch

flowchart LR
  API[Config API] --> APP[Evaluator/Policy Engine]
  APP --> CACHE[(Redis)]
  APP --> REPO[Config Repo]
  REPO --> SQL[(Azure SQL)]
  APP --> OUTBOX[Outbox -> Service Bus]
  style API fill:#19a974,stroke:#0e7a55,color:#fff
Hold "Alt" / "Option" to enable pan & zoom

Security & Tenancy

  • Mutations require config.manage + tenant_admin role.
  • Evaluation requires authenticated context; anonymous evaluation disabled except for public flags.

Observability

  • Spans: config.evaluate, config.invalidate.
  • Metrics: p95 evaluation latency ≤ 5 ms, cache hit%, invalidation fanout time.

Interface Outlines (selected contracts)

Tenant API

  • POST /api/tenants201 Created + tenant.created.v1
  • PATCH /api/tenants/{id}200 OK + tenant.updated.v1
  • POST /api/tenants/{id}:promote202 Accepted (async job)

Billing API

  • POST /api/billing/subscriptions202 Accepted (saga) + events
  • GET /api/billing/subscriptions/{id}200 OK

Config API

  • GET /api/config/flags/{key}?tenantId=...200 OK { "value": true, "reason": "tenant-override" }
  • POST /api/config/flags201 Created + config.flag.updated.v1

Solution Architect Notes

  • Metadata API acts as upstream catalog: Billing, Config, and Identity enrichments must not hardcode editions or features.
  • Always seed tenant entitlements from Metadata events → Config → Identity claims.
  • Keep Metadata auditable and append-only where possible; edition/feature definitions should be traceable across time.
  • Treat Metadata changes as high-risk operations; enforce RBAC + approval workflows.

Data & Storage Architecture

Purpose

Define a portable, Azure-first data architecture that supports multi-tenancy, high throughput, and auditability without locking products into one storage technology. Standardize relational vs document choices, partitioning strategies, read models, caching/search integration, and retention/archival so generated solutions can scale from trials to large enterprises with predictable cost and reliability.


Store Selection Principles

Concern Default Choice Alternatives Rationale
System-of-record, strong consistency Azure SQL (PostgreSQL compatible alternative) Managed Postgres/MySQL ACID, schema control, transactional outbox
Large/variable payloads (templates, messages) MongoDB (optional) Cosmos DB (Mongo API), Azure Blob Flexible schema, document access
Caching / fast eval (flags, sessions) Redis In-memory per pod (with eviction) Low-latency, TTL, pub/sub invalidation
Search / discovery Azure AI Search (opt-in) Elastic/OpenSearch Full-text, facets, suggesters
Analytics / cold storage Blob Storage (Parquet) ADLS Gen2 Cheap, durable, columnar query via Spark/SQL
Event backbone Azure Service Bus Kafka Durable pub/sub, topics, DLQs

Rule of thumb: Write models → relational, large/optional payloads → document/blob, query flexibility → search/read models, analytics/retention → blob.


Logical Data Model (core entities)

erDiagram
  TENANT ||--o{ USER : has
  TENANT ||--o{ SUBSCRIPTION : owns
  TENANT ||--o{ CONFIG_FLAG : configures
  TENANT ||--o{ USAGE_RECORD : generates
  TENANT ||--o{ AUDIT_EVENT : emits

  PRODUCT ||--o{ EDITION : offers
  EDITION ||--o{ ENTITLEMENT : contains
  SUBSCRIPTION }o--|| EDITION : references
  SUBSCRIPTION }o--|| TENANT : belongs_to
  ENTITLEMENT }o--|| FEATURE : grants

  USER {
    string userId PK
    string tenantId FK
    string email
    string roleSet
  }
  TENANT {
    string tenantId PK
    string name
    string region
    string isolationLevel  // pooled|schema|database
    string status          // active|suspended|deleted
  }
  PRODUCT {
    string productId PK
    string name
  }
  EDITION {
    string editionId PK
    string productId FK
    string name          // Free|Standard|Enterprise|Custom
  }
  FEATURE {
    string featureId PK
    string name
  }
  ENTITLEMENT {
    string entitlementId PK
    string editionId FK
    string featureId FK
    json   limits
  }
  SUBSCRIPTION {
    string subscriptionId PK
    string tenantId FK
    string editionId FK
    datetime startDate
    datetime endDate
    string status        // active|suspended|canceled
  }
  CONFIG_FLAG {
    string flagKey PK
    string tenantId FK
    json   value
    string scope        // global|edition|tenant|user
    datetime updatedAt
  }
  USAGE_RECORD {
    string usageId PK
    string tenantId FK
    string meter
    bigint amount
    datetime windowStart
    int windowSec
  }
  AUDIT_EVENT {
    string auditId PK
    string tenantId FK
    string actor
    string action
    string resource
    datetime occurredAt
    json   details
  }
Hold "Alt" / "Option" to enable pan & zoom

The Metadata (Product/Edition/Feature/Entitlement) drives Subscription and Config decisions; Usage feeds Billing; Audit is append-only.


Physical Partitioning & Isolation

Per-Context baseline

Context Physical Store Partition Key Secondary Partition Notes
Identity / Tenant / Billing / Config Azure SQL tenantId id per table RLS/tenant guards or DAL filters
Notifications (payloads/templates) MongoDB tenantId templateId Large blob-like docs, TTL indices
Usage (raw) Azure SQL (hot) + Blob (cold) tenantId + time meter Hot window (≤ 90 days), compaction
Audit Azure SQL (append-only) tenantId + time actor Immutable; export to blob for eDiscovery

Isolation levels

  • Pooled (default): all tenants share schema, enforced by RLS/guards; indexed on (tenantId, <business key>).
  • Schema-per-tenant: separate schema for hot tenants; connection factory selects schema based on Tenant Directory.
  • Database-per-tenant: separate DB + network isolation for enterprise/regulatory cases.

Promotion triggers

  • Hot partition (p99 latency, lock contention), data volume thresholds, contractual isolation, or regulatory residency.

Read Models & CQRS

  • Write models: normalized, transactional tables per bounded context.
  • Read models: denormalized projections built from events for UX queries and support dashboards (e.g., “tenant overview,” “subscription health,” “usage summary”).
  • Projections: idempotent consumers with inbox and checkpointing; rebuild on demand.
  • Search adapters: project to Azure AI Search indices (e.g., tenants, invoices, audit summaries) with indexer jobs and soft deletes.

Indexing & Query Patterns

  • Composite indices: (tenantId, naturalKey) as leading index across write tables.
  • Time-series: USAGE_RECORD (tenantId, windowStart DESC) for sliding windows; partitioned aggregation tables for hourly/daily rollups.
  • Audit: (tenantId, occurredAt DESC) covering actor, action, resource.
  • Avoid cross-tenant joins; analytics should use aggregations in a separate pipeline.

Caching Strategy

Cache Scope & Key Invalidation TTL
Tenant Directory tenant:{tenantId}:dir on tenant.updated/promoted 5–30s
Flag Evaluation flag:{tenantId}:{flagKey} config.flag.updated 30–120s
Entitlement Snapshot ent:{tenantId}:{editionId} billing.subscription.* / metadata.* 5m
OIDC JWKs & Metadata idp:jwks rotation events 15m
  • Prefer cache-aside; never cache PII without encryption.
  • Use Redis hash keys for compact, multi-field storage and partial invalidation.

Data Retention & Archival

Data Class Hot Retention Cold Retention Storage Notes
Usage (raw) 90 days 2 years (Parquet) Blob Aggregated hourly/daily kept hot
Audit 1 year hot 7 years cold SQL + Blob Immutable, exportable bundles
Notifications payloads 30 days N/A Mongo Store minimal PII; tokenize where possible
Invoices/PDFs 1 year hot 7 years cold SQL + Blob Legal/compliance governed
Config history 180 days N/A SQL Versioned changes for debugging
Tokens/session 24h N/A Redis No PII; rotate frequently

Retention windows are edition- and region-configurable; legal holds suspend deletion jobs.


Backup, DR & Consistency

  • SQL: PITR enabled; geo-redundant backups; weekly full + daily diff + 5-min log backups (policy baseline).
  • Mongo: point-in-time snapshots; validate TTL indexes.
  • Blob: versioning and soft-delete enabled; lifecycle rules to move to cool/archive tiers.
  • RPO/RTO: platform baseline RPO ≤ 15 min, RTO ≤ 4 h (overrides for Enterprise).
  • Consistency: outbox ensures atomic write + event; consumers are at-least-once, idempotent.

Security & Privacy

  • Encryption at rest: TDE for SQL, SSE for Blob, disk encryption for Mongo; CMEK where required.
  • Encryption in transit: TLS 1.2/1.3; mTLS inside cluster; Private Link for data stores.
  • PII minimization: store only necessary attributes; logs are redacted at sink.
  • Row-Level Security (RLS): preferred; otherwise enforce DAL guards and property-based tests.
  • Secrets: no inline secrets in tables; use Key Vault references; rotate keys ≤ 90 days.

Data Lifecycle & Governance

  • Schemas as code: migrations via EF Core/NHibernate + migration approvals (gated in CI).
  • CDC: used for online migrations, projections rebuilds, and promotion (pooled→schema/db).
  • Data quality checks: constraints + lightweight DQ jobs (nullability, referential integrity, outliers).
  • Change review: ADR required for breaking schema changes; contract tests validate read models.
  • Right to erasure: orchestrated delete with tombstones; audit holds tokenized references.

Example Physical Topology (simplified)

flowchart LR
  subgraph Hot Path
    SQL[(Azure SQL\nWrite Models)]
    REDIS[(Redis Cache)]
    BUS[[Service Bus]]
  end

  subgraph Projections
    CONSUMER[Projectors (Inbox/Idempotent)]
    READDB[(SQL Read Models)]
    SEARCH[(Azure AI Search)]
  end

  subgraph Cold Path
    BLOB[(Blob Storage\nParquet/Exports)]
    ANALYTICS[(Spark/SQL)]
  end

  BUS --> CONSUMER
  CONSUMER --> READDB
  CONSUMER --> SEARCH
  CONSUMER --> BLOB
  SQL <-- cache-aside --> REDIS
Hold "Alt" / "Option" to enable pan & zoom

Observability & SLOs

  • Signals: DB p95 latency, lock wait time, failed migrations, cache hit %, projection lag, index health, DLQ depth for projectors.
  • SLOs:
    • Write p95 ≤ 350 ms; Read p95 ≤ 200 ms
    • Projection lag p95 ≤ 60 s
    • Cache hit ≥ 85% for flag evaluation
    • Backup success 100%; restore drill quarterly

Failure Modes & Mitigations

Failure Impact Mitigation
Hot partition / noisy neighbor Latency spikes Promote tenant to schema/DB; shard by tenant; add covering indexes
Long-running transactions Lock contention Break writes into smaller batches; use optimistic concurrency
Projection backlog Stale read models KEDA-scale projectors; partial rebuild by tenant; prioritize critical topics
Cache stampede Thundering herd Request coalescing; jittered TTL; background refresh
Schema drift Consumer breaks Contract tests; additive changes; deprecation windows
Data corruption Incident/rollback PITR restore to side DB; compare via checksums; rehydrate projections

Solution Architect Notes

  • Start with pooled SQL and event-driven projections; add document/search only for proven needs.
  • Keep natural keys stable to enable safe migration/rebuilds.
  • Make retention a product setting—not hardcoded—so legal/compliance overlays can adjust.
  • Prefer append-only (Audit/Usage) plus compaction for analytics; avoid destructive changes in hot paths.

Messaging & Integration Patterns

Purpose

Standardize asynchronous communication across the platform using MassTransit with Azure Service Bus (ASB). Define topologies, routing conventions, retry/backoff/jitter, saga orchestration vs choreography, and compensation so services remain loosely coupled, reliable, and observable under load.


Topology & Conventions

Domain-first topology

  • Topic-per-domain: tenant-events, billing-events, config-events, usage-events, identity-events, notifications-events, audit-events.
  • Subscription-per-consumer: one subscription per logical consumer service (optionally per tenant segment or feature).
  • DLQ-per-subscription: automatic dead-letter queues with invariant retention.

Naming

  • Exchange/Topic: <domain>-events
  • Subscription: <consumer-svc>.<purpose> (e.g., billing-svc.rating, config-svc.invalidate)
  • Queue (commands): <svc>-cmd (optional; we favor events over commands across contexts)
  • Saga state store tables: <svc>_saga_<name>

Message envelopes (recap)

  • Required headers: type, id, time, traceId, tenantId, edition, schemaVersion, partitionKey, key (idempotency).
  • Partitioning: tenantId as default partitionKey to increase locality and ordering per tenant.

MassTransit Setup Patterns (C# excerpts)

Bus configuration (ASB + outbox)

services.AddMassTransit(x =>
{
    x.SetKebabCaseEndpointNameFormatter();

    x.AddEntityFrameworkOutbox<AppDbContext>(o =>
    {
        o.QueryDelay = TimeSpan.FromSeconds(1);
        o.DuplicateDetectionWindow = TimeSpan.FromMinutes(10);
        o.UseBusOutbox();
    });

    // Consumers, Sagas, Activities
    x.AddConsumersFromNamespaceContaining<TenantCreatedConsumer>();
    x.AddSagaStateMachine<SubscriptionSaga, SubscriptionState>()
        .EntityFrameworkRepository(r => r.ConcurrencyMode = ConcurrencyMode.Optimistic);

    x.UsingAzureServiceBus((context, cfg) =>
    {
        cfg.Host(builder.Configuration["ServiceBus:ConnectionString"]);
        cfg.MessageTopology.SetEntityNameFormatter(new DomainTopicFormatter());

        cfg.UseMessageRetry(r =>
            r.Exponential(5, TimeSpan.FromMilliseconds(200), TimeSpan.FromSeconds(10), TimeSpan.FromMilliseconds(50)));

        cfg.UseInMemoryOutbox(); // consumer-side dedupe window
        cfg.UseConcurrencyLimit(64);
        cfg.ConfigureEndpoints(context);
    });
});

Consumer template (idempotent + inbox)

public class TenantCreatedConsumer : IConsumer<TenantCreated>
{
    private readonly Inbox _inbox;
    public async Task Consume(ConsumeContext<TenantCreated> ctx)
    {
        if (!await _inbox.TryBeginAsync(ctx.Message.Id)) return; // dedupe
        try
        {
            // side-effect safe work
            await HandleAsync(ctx.Message);
            await _inbox.CompleteAsync(ctx.Message.Id);
        }
        catch (Exception ex)
        {
            await _inbox.FailAsync(ctx.Message.Id, ex);
            throw; // allow retry policy to engage
        }
    }
}

Retry, Backoff & Jitter

Scenario Policy Max Attempts Initial Delay Max Delay Notes
Transient network Exponential + jitter 5 200 ms 10 s Default consumer policy
Rate-limited upstream Decorrelated jitter 6 500 ms 30 s Honor Retry-After
Idempotent publish Linear 3 2 s 6 s Outbox ensures once-per-change
External webhooks Exponential + cap 8 1 s 5 min Move to DLQ after cap
Payment ACL Saga step specific 4 1 s 60 s Backoff grows per failure stage

Rules

  • Retry only idempotent operations; non-idempotent steps must be guarded by saga state.
  • Add small random jitter to prevent thundering herds.
  • After retries exhausted → DLQ with full context and last exception chain.

Orchestration vs Choreography

Pattern When to use Mechanism Pros Cons
Choreography Independent reactions to a fact (e.g., tenant.created) Events only Simple, scalable, low coupling Harder to visualize global flow
Orchestration Multi-step, long-running business process (e.g., subscription activation) Saga coordinates steps Centralized state, compensations explicit Orchestrator coupling; needs strong tests

Guideline: Prefer choreography for enrichment and projections. Use sagas only for business-critical, multi-step flows with compensations (payments, migrations).


Saga Orchestration (Billing Example)

State machine outline

stateDiagram-v2
  [*] --> Pending
  Pending --> Authorizing : command.received
  Authorizing --> Active : payment.captured
  Authorizing --> Dunning : payment.failed(retry_exhausted)
  Dunning --> Suspended : dunning.failed
  Active --> Suspended : subscription.payment.overdue
  Suspended --> Active : payment.captured
  Active --> [*]
Hold "Alt" / "Option" to enable pan & zoom

Key design

  • Idempotency: Correlate by subscriptionId (saga key). Each event mutates state exactly once.
  • Compensation: If invoice created but payment fails → issue credit note, revert entitlements, emit billing.subscription.suspended.
  • Timeouts: Each step has a receive timeout; when exceeded, move to next compensating step (e.g., dunning).

Activities (MassTransit)

  • ReservePlanActivityRequestPaymentActivityActivateEntitlementsActivity
  • On failure path: IssueCreditActivitySuspendSubscriptionActivity

Compensation Patterns

Failure Compensation Notes
Payment captured but entitlements not activated Refund/credit note, revoke token grants Ensure idempotent credit issuance
Tenant promoted but mapping not switched Roll back mapping; keep dual-writes; retry cutover Feature flag tenant.readonly protects
Email sent to wrong template Send corrective message, mark original as superseded Immutable log kept in Audit
Usage over-reported Emit usage.adjustment event; recompute invoice Maintain adjustment ledger

Technique

  • Compensations are first-class commands/events with their own audit entries.
  • No “delete-and-forget”; always append corrective facts.

Error Handling & DLQ Strategy

Handler contract

  • Validate envelope invariants (tenant, trace, type) first; reject missing or mismatched context (SEV-1 if produced internally).
  • Side effects must be wrapped with transaction boundaries; record idempotency outcome.

Dead-lettering

  • Criteria: max delivery count exceeded, non-transient exceptions (validation, authorization), poison messages (schema mismatch).
  • DLQ payload: original message + headers + last exception + handler name + attempt count.
  • Replay: operator-driven tool with safe-mode (dry-run), rate limiters, circuit breakers, and quarantine on re-poisoning.

Monitoring

  • Metrics: subscription lag, handler error rate, DLQ depth/age, saga timeout count.
  • Alerts: threshold breaches trigger runbooks (scale consumers, pause producers, enable backpressure at gateway).

Integration Patterns (edge & third parties)

  • Webhooks (outbound): Signed (HMAC), retry with backoff up to 24h, idempotency via Event-Id, age limit (drop after TTL).
  • Inbound third-party callbacks: Terminate at Gateway; validate signature & age; enqueue to inbox queue for processing.
  • Payment ACL: Isolate providers’ SDKs; map transient vs permanent failures; unify errors to domain codes.

Observability

  • Spans: publish, consume, saga.step, saga.compensate, webhook.request, webhook.retry.
  • Attributes: tenantId, type, key, sagaId, deliveryAttempt, queue, subscription.
  • Logs: structured with exception chains; no PII. Include producer.service, consumer.service.
  • Metrics: end-to-end event lag p95, publish success rate, handler retries, DLQ age p95, saga step durations.

Performance & Scalability

  • KEDA triggers on ASB metrics (queue length, lag); scale consumers horizontally.
  • Use prefetch and concurrency limits tuned per handler (e.g., heavy CPU vs I/O bound).
  • For hot tenants, prefer per-tenant subscriptions/queues to isolate and prioritize critical customers.

Security & Tenancy

  • Tenant scoping: reject events missing tenantId; never emit cross-tenant payloads unless flagged as analytics and routed separately.
  • mTLS inside the cluster; ASB credentials via workload identity.
  • Least privilege SAS/RBAC roles per consumer/producer; rotate keys ≤ 90 days.
  • Data minimization: events should reference entities, not embed sensitive data.

Solution Architect Notes

  • Use outbox everywhere—it’s the linchpin of dependable messaging.
  • Keep sagas lean and deterministic; external calls go through activities/ACLs with clear retry/timeout semantics.
  • Focus on idempotency: it’s cheaper to ensure than to diagnose duplicates in production.
  • Make DLQ a first-class workflow with replay tooling and tight observability—assume it will be used.

AI-First & Agentic Orchestration

Purpose

Embed safe, deterministic AI assistance into the SaaS Factory to accelerate planning, scaffolding, documentation, tests, and operational hygiene—without bypassing security, change control, or human judgment. Agents propose and scaffold; humans own approval and deployment. All agent actions are audited, observable, and reversible.


Agent Roles (factory-internal)

Agent Primary Outcomes Typical Triggers Key Outputs
Product Blueprint Agent Turn a product idea/recipe into an initial blueprint aligned to platform patterns New product request; edition/pack change HLD skeleton, context map updates, ADR drafts
Service Scaffolder Agent Create service projects from templates (API/Worker/Saga), wiring tenancy, OTel, health, outbox “Add service” request; new bounded context Repo branches/PRs with solution scaffold, CI pipeline YAML
Contract & SDK Agent Generate/validate OpenAPI & event schemas, produce language SDKs New/changed endpoints or events contracts/*.yaml/json, SDK packages, contract tests
Test Generator Agent Propose unit/contract/E2E tests; synth checks for SLOs New feature PRs; failing SLOs Test projects, synthetic monitors
Docs & Runbook Agent Produce developer and operator docs; incident runbooks New service or ADR; post-incident tasks docs/*, ops/runbooks/*
Operability Agent Create dashboards/alerts, SLOs, chaos experiments New service onboard; SLO drift Grafana dashboards, alert rules, Chaos experiments
Security & Compliance Agent Enforce guardrails (SBOM, SAST/SCA, secret scans), suggest remediations PR validation; dependency changes Policy reports, license notices, PR comments

Optional, tenant-facing assistants (e.g., support Q&A) are separate products behind strong data-isolation and are not assumed by the factory baseline.


Skills & Tooling (curated, least-privilege)

Skill/Tool Scope (allow-listed) Notes
Template Engine Read /factory/templates/**; write to feature branch No direct writes to main
Repo API Create branches/PRs; comment on PR; no force-push PR labels must indicate “ai-generated”
Contract Linter Validate OpenAPI/event schemas, versioning Fails on breaking changes without ADR
MassTransit/Outbox Scaffolder Wire messaging boilerplate Enforces outbox/inbox and OTel
IaC Generator (Bicep/Pulumi) Produce env-scoped stacks Read-only cloud; deploy only via pipeline
Policy Gate Runner SAST/SCA, license, SBOM, secret scan Blocks PR if violations found
Observability Pack OTel wiring, dashboard JSON, alert rules Requires SLO metadata
Doc/Runbook Composer Create/update Markdown and Mermaid Must include change rationale and rollbacks

All tools are invoked through Semantic Kernel with capabilities/RBAC matching the agent role. No tool exposes secrets or production tokens to agents.


Determinism & Safety

  • Model & prompting discipline: temperature ≈ 0, pinned models, structured prompts with explicit acceptance criteria.
  • Deterministic artifacts: agents produce diff-minimal changes; every artifact carries an x-origin: ai/<agent>/<hash> footer.
  • Reproducibility: inputs (recipe, prompts, params) logged; artifacts hashed; “re-run with same seed” supported.
  • Human-in-the-loop: agents cannot merge; required human review + green policy gates.
  • Data minimization: agents read only non-PII source; no tenant data; redaction in logs.
  • No live mutations: agents never call production APIs; all changes flow via PR → CI → deploy.
  • Denylist/Allowlist: explicit denied actions (e.g., dropping DB tables); tools must guard server-side.

Orchestration Flow (Semantic Kernel)

sequenceDiagram
  participant PO as Product Owner
  participant Orchestrator as SK Orchestrator
  participant Blueprint as Blueprint Agent
  participant Scaffolder as Service Scaffolder
  participant Contracts as Contract & SDK Agent
  participant Tests as Test Generator Agent
  participant Ops as Operability Agent
  participant Repo as Git/PR
  participant CI as CI/CD Pipeline

  PO->>Orchestrator: "New product/service recipe"
  Orchestrator->>Blueprint: Plan HLD/ADRs from recipe
  Blueprint-->>Orchestrator: HLD diff & ADR drafts
  Orchestrator->>Scaffolder: Generate service skeleton(s)
  Scaffolder->>Repo: Open PR with code + pipelines
  Orchestrator->>Contracts: Generate/validate contracts
  Contracts->>Repo: Commit schemas + contract tests
  Orchestrator->>Tests: Add unit/contract/E2E tests
  Tests->>Repo: Commit tests
  Orchestrator->>Ops: OTel wiring, dashboards, alerts
  Ops->>Repo: Commit observability pack
  Repo->>CI: PR checks (SAST/SCA, SBOM, tests, policies)
  CI-->>Repo: Status (pass/fail)
  Note over Repo,PO: Human reviews; merge if green
Hold "Alt" / "Option" to enable pan & zoom

Guardrails & Policies

  • Identity & RBAC: agents authenticate as service principals with minimal scopes; actions are auditable and reversible.
  • Policy gates: PRs must pass security scans, contract tests, SLO linters, and governance checks (ADR present for major changes).
  • Content safety: prompt-injection filters, tool-call allowlists, and output sanitizers (no secrets, no PII).
  • Rate & cost controls: per-agent quotas; budget alerts; offline mode fallback.
  • Rollbacks: every change includes a generated rollback playbook and revert.sh script when applicable.

Observability & SLOs (AI operations)

Signal Target Rationale
PR acceptance rate (ai-generated) ≥ 80% Indicates useful, review-ready output
Policy gate pass rate ≥ 95% Low violation rate
Mean time to scaffold service ≤ 10 min Responsiveness
Post-merge incident rate attributable to AI changes 0 Safety baseline
Reproducibility check (hash match) 100% Determinism

Traces include agentId, tool, operation, repo, branch, artifactHash. Logs exclude secrets and PII.


Failure Modes & Mitigations

Failure Symptom Mitigation
Hallucinated API/contract PR fails contract lint Contract Agent uses ground-truth schemas; block merge
Unsafe infra change Policy gate fails IaC policies (OPA/Conftest/Azure Policy) block; generate safer alt
Non-deterministic output Hash mismatch Pin model/version; lower temperature; freeze template version
Over-scaffolding (bloat) Large diff, unclear value Orchestrator trims plan; human prompt to narrow scope
Tool misuse Unauthorized API calls Tool RBAC + server-side enforcement; audit & revoke token

Solution Architect Notes

  • Treat the Orchestrator as a planner, not a do-everything agent; delegate to small, single-purpose agents with narrow tools.
  • Never let agents bypass PR review or production deployment pipelines.
  • Prefer small, composable PRs: one agent outcome per PR for easier review and rollback.
  • Make agent outputs teach-able: include rationale, caveats, and links to standards so humans learn and trust the system.

Observability by Design

Purpose

Establish a uniform, always-on telemetry pipeline built on OpenTelemetry for traces, metrics, and logs. Every service, job, and gateway must emit consistent signals enriched with multi-tenant context to enable SLO monitoring, fast incident response, and data-driven improvements. Observability is a non-removable guardrail.


Telemetry Architecture

Pipeline (high level)

  • Instrumentation: .NET OTel SDK in every service (HTTP, gRPC, MassTransit, SQL, Redis, custom spans).
  • Export: OTLP → Collector (agent/sidecar/daemonset) →
    • Traces: Tempo/Jaeger (or Azure Monitor/OpenTelemetry Distro)
    • Metrics: Prometheus (scraped from Collector or services) → Grafana
    • Logs: Structured JSON → Loki/Elastic/Azure Monitor Logs
  • Dashboards & Alerts: Grafana + Alertmanager (or Azure Monitor Alerts).
  • Correlation: W3C tracecontext (traceparent, tracestate) propagated end-to-end (gateway ↔ services ↔ jobs).
flowchart LR
  APP[Apps & Jobs (.NET + OTel)] --> COL[OTel Collector]
  GATE[Edge Gateway] --> COL
  BUS[Service Bus Consumers] --> COL
  COL --> TRC[(Traces)]
  COL --> MET[(Metrics)]
  COL --> LOGS[(Logs)]
  TRC --> GRAF[Grafana/Tempo]
  MET --> GRAF
  LOGS --> GRAF
Hold "Alt" / "Option" to enable pan & zoom

Required Attributes (span/log/metric labels)

Key Source Purpose
traceId, spanId OTel Correlation
tenantId Gateway/Service Multi-tenant scoping
edition Gateway/Config Entitlement context
routeId / operation Gateway/Service API and domain op naming
service.name, service.version OTel resource Ownership & rollout correlation
messaging.system, message.type, message.key MassTransit EDA correlation & idempotency checks
db.system, db.statement(redacted) ADO/EF Hot query detection
http.method, http.route, http.status_code ASP.NET Core API SLOs
job.name, job.idempotencyKey Jobs Job tracing
agentId (AI) AI Orchestration Agent provenance

PII is never recorded in attributes or logs. Use hashed or tokenized identifiers when needed for joins.


Span & Metric Conventions

Span naming

  • HTTP: http <VERB> <route> (e.g., http GET /api/tenants/{id})
  • Domain: <context>.<usecase> (e.g., billing.rate-usage)
  • Messaging: consume <type> / publish <type>
  • Jobs: job <name> (e.g., job dlq-replay)

Key metrics (Prometheus/OpenTelemetry Metrics)

  • API: http_server_duration_seconds (histogram), http_requests_total, http_errors_total
  • Messaging: consumer_lag_seconds, messages_processed_total, consumer_retry_total, dlq_depth
  • DB: db_client_duration_seconds, db_connections_in_use, deadlocks_total
  • Cache: cache_hit_ratio, cache_latency_seconds
  • Jobs: job_duration_seconds, job_failures_total, job_retries_total
  • SLO helpers: slo_availability_ratio, slo_latency_budget_burn, slo_error_budget_remaining

Logging

  • Structured JSON with timestamp, level, message, traceId, tenantId, edition, service, operation, exception.type, exception.stack(hash or summarized), fields{}.
  • Log levels: Info for state transitions, Warn for transient/backoff, Error for failed business operations, Fatal for process crash.

SLO Monitoring & Error Budgets

Golden paths (examples)

  • Auth token issuance: p95 ≤ 150 ms, availability ≥ 99.95%
  • Tenant onboarding (create → active): p95 ≤ 60 s, success rate ≥ 99%
  • Config evaluation: p95 ≤ 5 ms, availability ≥ 99.99%
  • Event ingestion to handling lag: p95 ≤ 60 s
  • Billing subscription activation (saga): p95 ≤ 2 min, failure < 0.5%

Error budget policy

  • If monthly SLO breaches consume >50% of budget: freeze non-urgent releases, run reliability epics.
  • Hard guardrails (no budget): Security, Telemetry integrity (missing tenantId/traceId), PII leakage.

Example Dashboards (sections)

  1. Edge/Gateway
    • Requests/sec by routeId, p95/99 latency, 4xx/5xx rates, rate-limit hits, canary weight, ejected backends.
  2. Service Health (per context)
    • API duration histogram, dependency latency (DB/Redis/HTTP), error rate, CPU/mem, pod restarts, rolling version mix.
  3. Messaging
    • Topic depth, subscription lag, consumer throughput, retry counts, DLQ depth/age, replay outcomes.
  4. Jobs
    • Success/failure, duration percentiles, retries, next runs, DLQ replayer status.
  5. Tenancy Overview
    • Top tenants by RPS, hottest partitions, promotion candidates, edition distribution, per-tenant error rates.
  6. AI Orchestration
    • PR acceptance %, gate pass rate, artifact hash reproducibility, cost usage.

Alerting (examples)

Alert Condition Severity Playbook
API latency SLO breach p95 http_server_duration_seconds > SLO for 5m P1 Scale out, check DB latency, roll back canary
Availability dip (service) 5xx rate > 2% for 10m P1 Trigger incident, flip traffic to stable, examine error budget
Consumer backlog consumer_lag_seconds > 120s or dlq_depth increasing P1 Scale consumers, inspect DLQ samples, enable backpressure
Missing tenant context % spans without tenantId > 0.01% P0 Block deployment; fix middleware; postmortem required
Cache miss storm cache_hit_ratio < 70% for 10m P2 Warm cache, check invalidation loop
Key rotation nearing expiry cert/key days_to_expiry < 14 P2 Rotate keys, verify JWKS/certs in all environments

Sampling & Cost Controls

  • Traces: start 10–20% head sampling; tail-based sampling at Collector for error/slow traces to 100% keep.
  • Logs: info logs rate-limited/bursty controls; DEBUG logging disabled in production (enable via scoped feature flag for time-boxed windows).
  • Metrics: prefer histograms over raw timings; align bucket bounds with SLOs.
  • Cardinality hygiene: bound label sets (e.g., truncate user/IDs, never include raw emails).

Instrumentation Checklist (service template defaults)

  • .NET OTel AspNetCore, HttpClient, SqlClient, MassTransit instrumentations enabled.
  • Correlation middleware ensures tenantId, edition, traceId on all spans/logs.
  • Health endpoints: /healthz (liveness), /readyz (readiness); export otel.instrumentation.version.
  • Startup failure telemetry: if app crashes before OTel init, fall back to minimal bootstrap logger writing to stderr with correlation IDs.
  • Synthetic checks tagged as client=synthetic to avoid skewing user metrics.

Data Safety & Privacy

  • Redaction at source: PII scrubbers for logs; SQL statement text parameter values redacted.
  • Audit alignment: link audit IDs in spans for admin actions; audit remains immutable.
  • Access control: Observability backends require SSO + RBAC; tenant-scoped views for support, platform-wide for operators.

Failure Modes & Mitigations

Failure Symptom Mitigation
Missing tenantId in spans/logs Hard to triage multi-tenant issues Block deployment (CI check); runtime guard drops context-less requests in non-public routes
Trace flood / high cost Collector pressure, storage bills Tail-based sampling; dynamic sampling policies; drop noisy internal spans
High cardinality labels Prometheus OOM, slow queries Static label allowlist; bounds on IDs; drop/rename labels
Collector outage Telemetry gaps Local buffering with retry; secondary collector; alert on exporter failures
Log PII leakage Compliance risk PII scanners in CI + runtime; auto-redaction; SEV-0 response

Solution Architect Notes

  • Define SLOs before code; dashboards and alerts are part of the service template and PR checked.
  • Favor tail-based sampling to keep costs predictable while capturing the right traces.
  • Tie deployment safety to observability: no public endpoint until OTel signals and dashboards are verified.
  • Make tenant and edition the first-class dimensions in every query and board—support depends on it.

Security & Threat Modeling

Purpose

Embed security-first principles into the SaaS platform design. Apply STRIDE threat modeling across all trust boundaries, enforce Zero Trust at edge and service-to-service, manage secrets with rotation, and ensure supply-chain integrity through SBOM and artifact signing. Security is non-negotiable and integrated into CI/CD and runtime.


Threat Model (STRIDE per Boundary)

Boundary Spoofing Tampering Repudiation Information Disclosure Denial of Service Elevation of Privilege
Edge / API Gateway Token forgery, session hijack Request manipulation Missing request logs Sensitive data leakage via headers Flooding / DDoS Path traversal, privilege escalation
Service-to-Service Forged service identity Malicious message injection Missing correlation Overexposed events (tenant mix) Queue flooding Overbroad service scopes
Data Stores Stolen credentials SQL injection, blob tampering No audit trails Unencrypted data at rest Hot partition overload Misconfigured RLS, schema promotion abuse
CI/CD & Supply Chain Build agent spoofing Artifact tampering Build logs altered Secrets leakage in logs Malicious PR floods pipeline Malicious dependency injection
Tenant-Facing UIs Phishing via iframe injection DOM/XSS Lack of audit for admin actions Misconfigured CORS Brute-force auth Unscoped RBAC flaws
AI Agents Prompt injection Malicious PR diffs No AI provenance logs Leakage of sensitive configs Resource abuse (cost spike) Agent bypassing guardrails

Zero Trust Principles

  • Authenticate everything: OAuth2/OIDC at edge; mTLS for inter-service; workload identities instead of static secrets.
  • Authorize explicitly: RBAC/ABAC checks enforced per operation; deny by default.
  • Audit everywhere: Immutable logs, linked to traceId, tenantId, actor.
  • Segment aggressively: per-service network policies, per-tenant data isolation (RLS/DB).
  • Assume breach: Red team simulation, chaos-security drills, SEV-0 for detected cross-tenant leakage.

Secrets & Key Management

  • Azure Key Vault (default) for secrets, keys, certs.
  • Rotation policies:
    • Tokens/keys ≤ 90 days
    • Certificates ≤ 1 year (auto-rotate with Key Vault Certificates)
  • Workload identity: Replace connection strings with AAD Managed Identity or Kubernetes Workload Identity.
  • Zero secrets in repo: Pre-commit hooks + CI scanners enforce.

Mutual TLS & Service Mesh Posture

  • Service-to-service: All gRPC/HTTP calls secured with mTLS via service mesh (e.g., Linkerd, Istio) or YARP with TLS termination.
  • Certificates: Issued and rotated automatically by mesh/Key Vault integration.
  • Trust store: Centralized CA; only platform-issued certs accepted.
  • Fallback: If mesh unavailable, services must still enforce TLS 1.2/1.3 with pinned certs.

Input Validation & Hardening

  • Gateway: global request validation (size, schema, rate).
  • Services: contract validation (OpenAPI, JSON Schema), input sanitization.
  • Databases: use parameterized queries (NHibernate/EF Core), enforce RLS.
  • Containers: minimal base images, read-only root FS, drop Linux capabilities, seccomp/AppArmor.
  • Kubernetes: pod security baseline; deny privileged containers.

Supply Chain Security

  • SBOM: generated per build (CycloneDX/Syft).
  • Dependency scanning: SCA in CI (Dependabot/Renovate + OSS Review Toolkit).
  • Artifact signing: cosign for container images; verify at deploy.
  • Provenance: SLSA Level 3 baseline: attestations for build, source, and dependencies.
  • Policy gates: block deploy if unsigned or vulnerable artifact.

Security Profile per Component

Component AuthN/AuthZ Data Security Audit Hardening
API Gateway OAuth2/OIDC, JWT validation, DPoP (optional) TLS termination Full request/response logs WAF rules, DoS protection
Identity Service OIDC/OAuth2, MFA for admin SQL TDE, hashed passwords (Argon2id) Token issuance logs SCIM support, federation
Tenant Service Scoped to tenantId RLS enforced Tenant lifecycle logs Residency enforcement
Billing Signed callbacks, PCI DSS zone Ledger immutability, encryption Invoice/payment audit Saga compensations logged
Config Service RBAC (admin only mutations) Encrypt secrets/flags Config change history Kill switch validation
Usage Service Signed ingestion events Partitioned by tenant Usage adjustment ledger Throttled ingestion
Notifications Signed webhook delivery Encrypt templates (at rest) Delivery logs Provider ACL
Audit Service Append-only Immutable schema, retention policy Non-repudiation Export controls
AI Orchestration RBAC + agent identity Redaction in prompts AI provenance logs Prompt-injection filters

CI/CD Security Gates

  • Static Analysis (SAST): Roslyn analyzers, SonarQube.
  • Secrets scanning: GitLeaks/TruffleHog in pipeline.
  • Dependency scanning (SCA): alerts + fail on critical.
  • Container scan: Anchore/Grype/Trivy in CI.
  • Infra as Code scan: Checkov/OPA on Bicep/Pulumi.
  • PR checks: ADR presence, SBOM generated, cosign verification.

Observability & SLOs (Security)

  • Signals: token issuance latency, failed login rate, RBAC policy evaluation, key rotation lag, SBOM freshness.
  • SLO targets:
    • Token issuance success ≥ 99.95%
    • Cross-tenant access violations = 0
    • Secrets exposure in repo = 0
    • Key/cert expiry incidents = 0
    • Critical CVEs unpatched ≤ 7 days

Failure Modes & Mitigations

Failure Symptom Mitigation
Tenant data leakage Cross-tenant records in query/event Block at DAL/middleware; automated test harness; SEV-0 incident response
Expired certs Service call failures Auto-rotation with Key Vault; monitor expiry; staged rotation tests
Supply chain attack Dependency injection of malware SBOM + provenance; strict registry mirror; signed artifacts only
Prompt injection (AI) Malicious PR diffs or secret exfiltration attempts Input sanitizers; denylist filters; tool-call allowlist
Stolen secrets Credential replay Rotate keys; enforce workload identity; block static credentials

Solution Architect Notes

  • Treat security violations as SEV-0—no tolerance for cross-tenant leakage or unsigned artifacts.
  • Bake security gates into templates so generated services are secure-by-default.
  • Prefer short-lived credentials + workload identity over all else.
  • Keep attack surface minimal: fewer protocols, minimal images, tight RBAC.
  • Run quarterly threat model reviews and refresh STRIDE table as system evolves.

Resilience & Reliability Policies

Purpose

Establish a resilience-first posture to ensure services remain predictable, recoverable, and observable under failures. Standardize timeouts, retries, circuit breakers, bulkheads, and fallbacks across all services, supported by chaos experiments and steady-state SLO validation.


Core Reliability Patterns

Pattern Application Notes
Timeouts Every external call (HTTP/gRPC/DB/cache/message broker) Default 2–5s; domain-specific overrides; no unbounded waits
Retries Transient failures (network, 429, 5xx) Exponential backoff + jitter; max attempts ≤ 5; idempotent only
Circuit Breakers Downstream repeated failures Half-open after cooldown; reject early to prevent cascades
Bulkheads Resource partitioning Thread pool isolation per dependency; partition tenants if noisy
Fallbacks Non-critical paths (config, cache, search) Return defaults/stale data; never bypass auth/billing/audit
Graceful Degradation Feature toggles Disable non-essential modules to preserve core flows
Idempotency API + event handlers Safe retries, deduplication; enforced by outbox/inbox

Policy Matrix

Dependency Timeout Retry Policy Circuit Breaker Bulkhead Fallback
API Gateway → Service 3s 3x exponential (100ms → 2s) Open after 5 failures / 30s Route pool partitioning Serve cached config/error envelope
Service → DB (SQL) 5s 2x linear (1s) Open after 3 failures / 10s Connection pool isolation per tenant None; fail-fast
Service → Cache (Redis) 2s 3x exponential (50ms → 1s) Open after 5 failures / 15s Separate connection pools Stale read (optional)
Service → Service Bus 5s 5x exponential (100ms → 5s) Open after 10 failures / 60s Consumer concurrency limits Store in outbox for retry
Service → External Provider (Payments, Email) 10s 4x exponential (500ms → 30s) Open after 5 failures / 60s Thread pool partition Retry + DLQ; tenant notified
Jobs (background) Job-level SLA (e.g., 60s) Retry up to 3x Abort if circuit open Queue partitioning Reschedule; quarantine tenant batch

Chaos Engineering & Failure Injection

Goals

  • Validate that resilience patterns protect SLOs under fault injection.
  • Prove steady-state system remains within error budget even under chaos.

Scenarios

  1. Network latency: inject 500 ms–2 s delays between services.
  2. Dependency crash: kill DB/read replica; service bus outage.
  3. Message flood: burst 10× normal tenant traffic.
  4. Certificate expiry: simulate expired TLS/mTLS certs.
  5. Cache poisoning: inject stale config flags.
  6. AI agent misuse: agent suggests invalid scaffolding PRs.

Execution

  • Use chaos mesh/litmus in AKS, or Azure Chaos Studio.
  • Run during off-peak; abort if critical SLO breach > 15 min.
  • Record metrics: error rate, p95 latency, recovery time.

Resilience Testing & Steady-State SLOs

Steady-State Hypothesis: “The system continues to meet its defined SLOs under injected failures.”

Test Harness

  • Synthetic checks (login, tenant onboarding, subscription activation, config flag evaluation).
  • Run continuously; inject chaos in staging and periodically in prod (with guardrails).

SLO Verification

  • Auth token issuance: ≥ 99.95% success under chaos.
  • Tenant onboarding: p95 ≤ 60s with one DB node offline.
  • Config evaluation: p95 ≤ 10ms even if Redis down (fallback applies).
  • Billing saga: compensates gracefully within 2 min if provider unreachable.

Observability Integration

  • OTel spans mark retries, fallback paths, circuit breaker open/half-open.
  • Metrics: retry_attempts_total, circuit_open_total, fallback_requests_total, bulkhead_rejections_total.
  • Dashboards: reliability view per service, error budget burn rate, chaos experiment outcomes.
  • Alerts:
    • Circuit breaker open rate > 5% (P1)
    • Retry storm > 1000/min (P1)
    • Error budget burn > 20% in 24h (P0 freeze new releases)

Failure Modes & Mitigations

Failure Symptom Mitigation
Retry storm → overload High CPU, cascading failure Add jitter; exponential backoff; max retry cap
Circuit breakers too aggressive False positives; degraded UX Tune thresholds; log open/close events
Bulkhead misconfig Resource starvation across tenants Separate thread pools; enforce quotas
Fallback leakage Sensitive data exposed in defaults Redact PII; fallback only to safe defaults
Chaos experiment causes outage Real users impacted Run in staging; prod chaos with guardrails and kill switch

Solution Architect Notes

  • Bake resilience policies into templates—engineers shouldn’t reinvent timeouts or retries.
  • Ensure compensations are explicit, observable, and auditable.
  • Treat chaos engineering as validation of design assumptions, not an afterthought.
  • Tie resilience metrics into error budget policies to guide release decisions.

Jobs & Scheduling

Purpose

Standardize background processing across the platform to handle recurring, delayed, or one-off workloads that do not fit into synchronous request/response or event-driven flows. Ensure jobs are idempotent, observable, and auditable, with UTC-based scheduling and clear operational runbooks for reliability.


Job Frameworks

Type Default Framework Notes
Short-lived, ad hoc background work KEDA + Service Bus Queue Consumers Elastic scaling; event-driven jobs
Recurring & delayed jobs Hangfire (SQL/Redis backend) Cron-based recurring jobs; dashboards
Heavy/batch jobs KEDA scaling with containerized workers Scale based on queue depth/metrics
Maintenance jobs Kubernetes CronJobs Low-frequency (e.g., nightly cleanup, backups)

Principle: Jobs are treated as first-class services with the same tenancy, security, and observability constraints as APIs.


Scheduling Strategy

  • Time Standard: All schedules defined in UTC (no local timezones to avoid drift).
  • Recurrence:

  • Hangfire CRON expressions stored under jobs/schedules/ config.

  • Critical jobs documented with frequency, duration SLA, and recovery playbook.
  • One-off jobs: Triggered via API or CLI, persisted in job store with jobId, tenantId, and status.
  • Drift prevention: Time sync enforced across nodes (NTP).

Idempotency & Safety

  • Idempotency key: every job carries a unique jobKey = <jobName>:<scope>:<timestamp> (e.g., invoice:tenant123:2025-10-01).
  • Retries: capped exponential backoff (max attempts configurable); retries logged with correlation IDs.
  • Poison jobs: moved to DLQ table with full error context and manual replay tooling.
  • Concurrency guard: distributed locks (e.g., SQL/Redis) to prevent double-execution of same job key.
  • Cancellation: jobs respond to cancellation tokens; long-running jobs checkpoint progress.

Job Categories (examples)

Category Examples Notes
Operational Tenant promotion, DB migrations, cache warm-up High-priority, operator-triggered
Business Invoicing, subscription renewal, quota resets Time-critical, audited
Data Projections rebuild, DLQ replay, report generation Idempotent; resumable
Notifications Email/SMS campaigns, retries Staggered sends to avoid spikes
Maintenance Cleanup expired sessions, rotate keys, purge logs Non-urgent; must not impact SLOs

Observability & Telemetry

Traces

  • job.schedule (scheduled at time X)
  • job.execute (span per execution, includes jobId, jobKey, tenantId)
  • job.retry and job.fail events

Metrics

  • job_duration_seconds (histogram per job type)
  • job_success_total, job_failure_total
  • job_retries_total
  • job_scheduled_next{jobName} (gauge, Prometheus style)
  • dlq_jobs_total, dlq_jobs_age_seconds

Dashboards

  • Job health view: success/failure rates, retry counts, p95 execution times.
  • DLQ board: job type, failure reason distribution, replay outcomes.

Operational Runbooks (baseline)

  • Job failed repeatedly

  • Inspect DLQ table entry.

  • Review trace/log context.
  • Fix underlying cause (data, downstream).
  • Trigger replay via job API (POST /api/jobs/{id}:replay).

  • Job backlog growing

  • Check KEDA scaling triggers.

  • Scale out workers (manual override if needed).
  • Verify no partition hot-spot (tenant/queue).
  • Clear DLQ separately.

  • Misconfigured CRON

  • Validate against jobs/schedules/ config.

  • Ensure UTC alignment.
  • Correct CRON expression; redeploy.

  • Stuck job

  • Cancel job via API.

  • Mark as failed with checkpoint.
  • Restart from last checkpoint.

Failure Modes & Mitigations

Failure Symptom Mitigation
Duplicate execution Two workers run same job Use distributed lock; idempotency key
DLQ growth Too many poison jobs Replay tooling; operator alerts
Clock drift Jobs run at wrong time Enforce UTC; NTP sync across nodes
Long-running jobs hang High latency, blocking resources Cancellation tokens; checkpoints; watchdog alerts
Thundering herd on retries Spikes after outage Retry with jitter/backoff; stagger scheduling

Solution Architect Notes

  • No hidden jobs: every job must be defined in config and visible on dashboards.
  • Always idempotent: jobs must tolerate retries and safe re-execution.
  • Runbooks mandatory: each recurring job must include operational steps in /ops/runbooks/jobs/.
  • Chaos test jobs: periodically inject job failures to validate retries and DLQ handling.

Configuration, Feature Flags & Edition Overrides

Purpose

Standardize configuration, feature flagging, and edition-specific overrides across the SaaS platform using external configuration/flag microservices (not embedded in the platform itself). Ensure safe rollouts, per-edition capability gating, kill-switches, and integration with Azure App Configuration and Key Vault for secrets.


Principles

  • Externalization: all runtime configuration, flags, and edition overlays are retrieved from Config/Feature microservices, not embedded in code.
  • Separation of concerns:
    • Secrets → Key Vault
    • Application settings → Azure AppConfig (or external Config service)
    • Feature flags → Feature Flag Service (edition- and tenant-aware)
  • Progressive rollout: gradual exposure of new features via percentage, tenant, or edition filters.
  • Kill-switches: rapid disablement of features at runtime for stability or security.
  • Edition overlays: baseline features defined in Metadata/Entitlements, overridden dynamically by edition or tenant flags.

Integration Pattern

Service startup flow

  1. Service authenticates via workload identity.
  2. Fetches configuration values (non-secret) from AppConfig or Config microservice.
  3. Fetches secrets from Key Vault (connection strings, API keys).
  4. Subscribes to config change notifications (events).
  5. Evaluates feature flags via Flag Evaluation API with context: { tenantId, edition, userId, environment }.
  6. Uses cached flag values with TTL; invalidates on config.flag.updated events.

Feature Flag Evaluation

Flag structure

{
  "flagKey": "betaFeatureX",
  "default": false,
  "rules": [
    { "target": "edition:Enterprise", "value": true },
    { "target": "tenant:t-123", "value": true },
    { "target": "percentage:10", "value": true }
  ],
  "killSwitch": false,
  "updatedAt": "2025-09-29T11:00:00Z"
}

Evaluation precedence

  1. Kill-switch (if true → disabled globally).
  2. Tenant-specific override.
  3. Edition-specific override.
  4. Progressive rollout filters (e.g., percentage, region).
  5. Default value.

Example API (Config Service external)

GET /api/flags/{flagKey}?tenantId=t-123&edition=Standard
200 OK
{
  "flagKey": "betaFeatureX",
  "value": true,
  "reason": "tenant-override"
}

Edition Overlays

  • Baseline: Metadata API defines products, editions, and entitlements.
  • Overlay: Config/Flag service applies dynamic rules on top (enable/disable features at runtime).
  • Example:
    • Edition: Standard includes Feature A.
    • Config overlay disables Feature A for tenant:t-999 (due to contractual restriction).
    • Config overlay enables Feature B early for tenant:t-123 as part of beta.

Kill-Switches

  • Definition: global flags (kill:<featureKey>) that force-disable functionality.
  • Usage: emergency disablement of faulty/new features.
  • Enforcement: evaluated in gateway and service middleware; takes precedence over tenant/edition overrides.
  • Auditability: all kill-switch flips logged in Audit service (audit.action.logged.v1).

Rollout Flows

Progressive rollout (percentage-based)

  1. Flag created as default=false.
  2. Rule applied: percentage=5% for target feature.
  3. Gradually increased to 25%, 50%, 100%.
  4. Rollout metrics tracked: error rates, adoption, SLO impact.
  5. Kill-switch available at every stage.

Targeted rollout (tenant or edition-based)

  • Enterprise tenants get early access to new feature.
  • Beta tenants opt-in by contract.
  • Free/Standard remain unaffected until GA.

Operational Runbook

  • Document rollout strategy in /ops/runbooks/flags/<flagKey>.md.
  • Define: target tenants, monitoring metrics, rollback steps.

Observability

  • Spans: config.fetch, config.evaluate, flag.evaluate.
  • Metrics:
    • config_fetch_latency_seconds (p95 ≤ 50ms)
    • flag_eval_latency_seconds (p95 ≤ 5ms, Redis cached)
    • config_update_events_total
    • flag_killswitch_activated_total
  • Dashboards: flag evaluation success rate, cache hit ratio, rollout coverage %.
  • Alerts: kill-switch activation triggers P1 audit/notification; flag evaluation failures > 0.1% → P1 incident.

Security & Tenancy

  • Secrets: only pulled from Key Vault with workload identity; rotated automatically.
  • Config/Flags: always evaluated with tenantId; anonymous calls disallowed.
  • Multi-tenant enforcement:
    • RLS at config service.
    • Tenant isolation in flag definitions.
    • Edition scoping enforced at query layer.

Failure Modes & Mitigations

Failure Symptom Mitigation
Config service outage Features unavailable / mis-evaluated Local cache with TTL; stale-while-revalidate
Kill-switch delay Feature not disabled fast enough Push invalidation events; force cache bust
Overlapping overrides Confusing feature state Precedence rules enforced; audit trail required
Flag drift between envs Inconsistent tenant experience Config sync job; environment drift monitor
Secrets leakage PII in config CI scans; Key Vault only; deny secrets in AppConfig

Solution Architect Notes

  • Treat Config & Feature Flags as external microservices—our SaaS platform integrates with them, but does not own them.
  • AppConfig + Key Vault are integration points for base values/secrets, but dynamic logic lives in Config Service.
  • Bake flag evaluation middleware into service templates so every new service is “flag-aware.”
  • Document rollout/rollback strategies per flag—never release without a defined kill-switch.

API Design Standards & Versioning Policy

Purpose

Provide factory-wide API conventions so every generated product exposes a predictable, secure, and evolvable interface. External APIs are REST-first; internal service-to-service can additionally use gRPC where efficiency or streaming is beneficial. Contracts are spec-first (OpenAPI/Protobuf) with strong governance, multi-tenant invariants, and clear deprecation windows.


Protocol Posture

Audience Protocol Usage
External (public) REST/HTTP CRUD and task-oriented endpoints; JSON over HTTPS; HATEOAS not required; Problem+JSON.
Internal (services) gRPC (opt) High-throughput, low-latency calls; streaming; strictly inside the mesh with mTLS.
Async Events/Webhooks Event-driven facts via Service Bus; signed webhooks for tenant integrations.

Rule: prefer async events over synchronous cross-context calls. Use gRPC only when you need tight request/response between trusted services.


Versioning Strategy

Resource/API versions

  • URI versioning for REST (public): /v1/..., /v2/....
  • Content negotiation (optional): Accept: application/vnd.connectsoft.v1+json.
  • gRPC: package versioning (package billing.v1;) and additive field numbering.

Compatibility rules

  • Minor, additive changes (new fields) do not bump major.
  • Breaking changes → new major (v2) and new route set.
  • Sunset policy: minimum 12 months support for the previous major after v{N+1} GA (Enterprise can contract longer).
  • Deprecation headers: Deprecation, Sunset, Link: <changelog>; rel="deprecation" emitted by Gateway for deprecated versions.

Resource Modeling & Naming

  • Plural nouns for collections: /tenants, /subscriptions, /flags.
  • Hierarchy only where ownership is strict: /tenants/{tenantId}/subscriptions/{id}.
  • Actions-as-subresources (no verbs in paths):
    • POST /subscriptions/{id}:cancel
    • POST /tenants/{id}:promote
  • Idempotency for unsafe operations: clients send Idempotency-Key (UUID). Server stores outcome for at least 24h.

Pagination, Filtering, Sorting

Pagination

  • Cursor-based by default: ?cursor=<opaque>&limit=50
  • Response:
{
  "items": [ /*...*/ ],
  "pageInfo": { "nextCursor": "eyJhIjoi...\"", "hasNextPage": true, "count": 50 }
}

Filtering

  • ?filter=field1:eq:value1,field2:lt:10 (simple RSQL-like), or discrete params for common fields.
  • Disallow filtering on sensitive/PII fields.

Sorting

  • ?sort=+createdAt,-name (stable secondary sort by id).

Total counts

  • Avoid for hot paths; expose separate HEAD or /stats if needed.

Standard Request & Response Conventions

Headers (required/standardized)

  • Inbound: Authorization: Bearer <JWT>, X-Tenant-Id (edge injects if not provided), X-Request-Id/Traceparent.
  • Rate limits: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.
  • Idempotency: Idempotency-Key (unsafe methods).
  • Caching: ETag, If-None-Match, Cache-Control for GETs where safe.

Error envelope (Problem+JSON)

{
  "type": "https://docs.connectsoft.cloud/problems/tenant-ambiguous",
  "title": "Tenant context is ambiguous",
  "status": 400,
  "traceId": "00-7e0d...-01",
  "tenantId": "t-123",
  "detail": "X-Tenant-Id header conflicts with token claim",
  "errors": [ { "code": "TENANT_MISMATCH", "path": "$.headers.x-tenant-id" } ]
}
  • Always include traceId, and when present, tenantId.
  • Avoid PII in detail.

Validation

  • Use JSON Schema derived from OpenAPI components; reject unknown fields when strict=true flag is enabled (default for internal/gRPC).

Tenancy & Security Invariants

  • Every authenticated request must resolve a single tenant; ambiguous → 400.
  • Authorization: scope + role check at edge; resource-level ABAC in service.
  • Data access is tenant-scoped; cross-tenant resources are never addressable by ID alone.
  • PII hygiene: redact in logs; never echo secrets.

REST Resource Examples (concise)

Tenants

POST   /v1/tenants
GET    /v1/tenants/{tenantId}
PATCH  /v1/tenants/{tenantId}
POST   /v1/tenants/{tenantId}:promote        // isolation level upgrade

Billing

POST   /v1/subscriptions                     // Idempotent; requires Idempotency-Key
GET    /v1/subscriptions/{id}
POST   /v1/invoices/{id}:pay
GET    /v1/subscriptions?cursor=&limit=&sort=

Config & Flags

GET    /v1/flags/{key}?context=user:...      // evaluation; no anonymous
POST   /v1/flags                              // admin only
POST   /v1/overrides                          // tenant/edition overrides

gRPC Internal Standards (optional)

  • Service boundaries mirror REST resources, but optimized for chatty or streaming ops (e.g., usage ingest):
    • Package: usage.v1, Service: Ingest with rpc Append(stream Record) returns (AppendAck)
  • Auth: mTLS + per-RPC auth via authorization metadata; include x-tenant-id.
  • Backwards compatibility: fields never renumbered; only add new optional fields.

Caching & Conditional Requests

  • Safe GETs return ETag; clients may send If-None-Match304 for unchanged resources.
  • TTL hints via Cache-Control: private, max-age=30 for non-sensitive tenant reads.
  • Do not cache mutable or sensitive resources (billing actions, identity).

Rate Limiting & Quotas

  • Enforced at gateway; mirrored by services to prevent bypass.
  • Headers communicate quota state (see above).
  • 429 includes Retry-After and X-RateLimit-Policy (edition/plan identifier).

Idempotency & Concurrency

  • Unsafe methods (POST, PATCH, DELETE) accept Idempotency-Key; server guarantees exactly-once effect within 24h window.
  • Optimistic concurrency: If-Match: <ETag> required for updates; stale tag → 412 Precondition Failed.

OpenAPI & SDKs

  • Source of truth: contracts/openapi/<service>-v{N}.yaml.
  • Linting rules:
    • Must document auth requirements, tenant context, error responses (4xx, 5xx).
    • Pagination schema reusable (PageInfo).
    • Examples for each operation with tenantId and traceId.
  • SDK generation: Typescript/.NET Python as needed; versioned per API major.

Deprecation & Sunset Workflow

  1. Mark endpoints/fields as deprecated in OpenAPI (deprecated: true) and docs.
  2. Notify via tenant/admin changelog events and status page.
  3. Emit deprecation headers from edge for affected routes.
  4. Observe usage via logs/metrics per version.
  5. Remove only after window closes and usage < threshold (e.g., <1% over 30 days).

Observability & SLOs

Traces

  • Span names: http <VERB> <route>; attributes: tenantId, edition, routeId, version, idempotencyKey (when present).

Metrics (per route & version)

  • http_requests_total, http_server_duration_seconds (histogram), http_4xx/5xx_total, rate_limit_hits_total.

Baselines

  • p95 read ≤ 200 ms, write ≤ 350 ms; success ≥ 99.9%; schema/contract lint pass 100% in CI.

Risks & Trade-offs

Risk/Decision Trade-off Mitigation
URI versioning proliferation More routes to maintain Contract governance, codegen, and routing templates
Cursor pagination complexity Harder for simple clients Provide helper SDKs and compatibility with offset
Strict idempotency semantics Server storage for keys/outcomes TTL window; compact storage; GC jobs
Long deprecation windows Slower platform evolution Telemetry-driven removal; enterprise opt-in support
Dual REST/gRPC posture Two stacks to maintain internally Use gRPC selectively; concentrate shared interceptors

Solution Architect Notes

  • Treat contracts as code: PRs that change APIs must update OpenAPI/Protobuf, examples, and consumer contract tests.
  • Keep tenant context explicit; never infer from payloads alone.
  • Prefer event notifications (webhooks/events) to polling—then design REST for discovery & control, not data streaming.
  • Document SLOs and error budgets at the operation level for critical routes (auth, onboarding, billing).

Webhooks & Extensibility

Purpose

Provide a generic, reusable webhook capability for all SaaS solutions produced by the factory. Webhooks extend the platform to tenant systems and partner integrations without tight coupling, using signed deliveries, durable retries with ageing, idempotency, replay, and a self-service management API. The design is multi-tenant, edition-aware, and can be embedded as a shared microservice in other platforms.


Architecture Overview

Components

  • Webhook Manager (API): Tenant-scoped CRUD for subscriptions, secrets, filters, delivery policies, and replay requests.
  • Webhook Dispatcher (Worker): Consumes domain events, applies routing rules, signs payloads, and performs resilient delivery with backoff and DLQ.
  • Subscription Store: Tenant-scoped definitions (endpoint URL, secret, filters, version, status).
  • Delivery Store: Durable log of attempts/outcomes for replay, auditing, and idempotency.
  • Schema Catalog: Canonical schemas (by event type/version) exposed for consumer validation/generation.

Event Flow

  1. Domain services publish events to the event bus.
  2. Dispatcher projects them against active subscriptions (per tenant) and enqueues delivery jobs.
  3. Delivery attempts use HMAC-signed HTTP calls; failures retry with exponential backoff + jitter until age limit reached.
  4. After max age/attempts, jobs are moved to DLQ; operators or tenants can replay.
flowchart LR
  E[Domain Events (Service Bus)] --> D[Webhook Dispatcher]
  D -->|match filters| Q[Delivery Queue]
  Q -->|http POST| R[Receiver Endpoint]
  R -->|2xx| DS[Delivery Store]
  R -->|>=400| Q
  D -->|exhausted| DLQ[Dead-Letter Queue]
  M[Webhook Manager API] --> S[Subscription Store]
  M --> DS
  M -->|replay| Q
Hold "Alt" / "Option" to enable pan & zoom

Delivery Contract

HTTP Method: POST Content-Type: application/json (UTF-8) Headers (required):

  • Webhook-Id: stable UUID for the subscription (not the tenant)
  • Webhook-Event-Id: unique delivery id (UUID)
  • Webhook-Event-Type: canonical type (e.g., tenant.created.v1)
  • Webhook-Event-Time: RFC3339 timestamp
  • Webhook-Signature: sha256=<hex> (HMAC over {timestamp}.{body})
  • Webhook-Timestamp: UNIX seconds used in signature
  • Webhook-Retry-Count: attempt number
  • Webhook-Tenant-Id: source tenant
  • Traceparent: W3C trace context for end-to-end correlation

Payload (envelope):

{
  "id": "9f4a2d64-7c8a-4a25-9a0a-0f0e9b5dcb31",
  "type": "tenant.created.v1",
  "occurredAt": "2025-09-29T09:30:00Z",
  "tenantId": "t-123",
  "schemaVersion": "1",
  "data": {
    "name": "Acme",
    "edition": "Standard"
  }
}

Security

  • Signature secret is per subscription and rotates without downtime (see Rotation).
  • Receivers must validate timestamp freshness (e.g., ≤ 5 minutes drift) and recompute HMAC.

Subscription Model (multi-tenant)

Field Description
subscriptionId Stable UUID
tenantId Owner tenant (scopes visibility & quota)
endpointUrl HTTPS only; optional mTLS
secret HMAC secret (active)
nextSecret Staged secret for rotation
enabled true/false
eventTypes Allow-list (supports wildcards: billing.*)
filters Attribute predicates (e.g., edition in ['Enterprise'])
deliveryPolicy Retries/backoff, timeouts, concurrency, ageing window
rateLimit Per-subscription throttle (req/min)
version Schema version preference (default: latest compatible)

Delivery Semantics

  • At-least-once. Receivers must be idempotent.
  • Idempotency-Key: Webhook-Event-Id provided; consumers should persist outcomes keyed by it.
  • Retries: Exponential backoff with jitter (e.g., 1s, 4s, 16s, 64s, … up to policy cap).
  • Ageing: Stop retrying when age limit reached (default 24h); event moved to DLQ.
  • Replay: Tenants/operators can request targeted replays by eventId, time window, or filter (respects age & legal constraints).
  • Ordering: Not guaranteed across topics; best-effort per subscription if the receiver returns promptly. For strict order, recommend event-sourced consumer pattern or per-aggregate subscriptions.

Webhook Management API (examples)

Create subscription

POST /v1/webhooks/subscriptions
{
  "endpointUrl": "https://acme.example.com/hooks",
  "eventTypes": ["tenant.created.v1", "billing.subscription.*"],
  "filters": ["edition in ['Enterprise','Standard']"],
  "deliveryPolicy": { "timeoutMs": 5000, "maxAttempts": 10, "maxAge": "24h" },
  "rateLimit": { "rpm": 600 }
}

Rotate secret (staged)

POST /v1/webhooks/subscriptions/{id}:rotate
// Generates nextSecret; both secrets valid for N hours

Replay

POST /v1/webhooks/deliveries:replay
{
  "subscriptionId": "…",
  "from": "2025-09-28T00:00:00Z",
  "to":   "2025-09-28T23:59:59Z",
  "filters": ["type like 'billing.%'"]
}

List deliveries

GET /v1/webhooks/deliveries?subscriptionId=...&cursor=&limit=

Security Model

  • HTTPS required, optional mTLS for high-trust partners.
  • HMAC-SHA256 signatures; body canonicalized as raw bytes.
  • Secret Rotation: dual-secret window (active + next). Webhook-Signature includes key id when dual-valid to disambiguate.
  • IP allow-lists (optional) and DNS pinning (resolve at delivery) to reduce SSRF risks.
  • Payload hygiene: No PII unless explicitly enabled by tenant policy; sensitive fields can be tokenized.

Edition & Quota Integration

  • Free: limited subscriptions (e.g., 1), low RPM, short retention.
  • Standard: moderate RPM, standard retention, replay allowed.
  • Enterprise: higher RPM/concurrency, longer retention, mTLS support, dedicated IPs.
  • Custom: configurable caps per product/contract.

Quota enforcement happens in Webhook Manager and Dispatcher; gateway surfaces headers:

  • X-WebhookRate-Limit, X-WebhookRate-Remaining, X-WebhookRate-Reset.

Observability

  • Tracing: Each delivery attempt creates a span: webhook.deliver; attributes: tenantId, subscriptionId, endpointHost, attempt, statusCode, latencyMs.
  • Metrics:
    • webhook_deliveries_total{status=2xx/4xx/5xx}
    • webhook_retry_attempts_total
    • webhook_latency_ms (histogram)
    • webhook_signature_failures_total
    • webhook_replay_requests_total
  • Logs: Structured, with redacted URLs/secrets, and correlation to source domain event via sourceEventId.

Failure Modes & Playbooks

Failure Symptom Action
Signature mismatch 401/403 at receiver Verify timestamp skew, secret, canonicalization; rotate secret if suspected leak
Persistent 5xx DLQ growth Backoff increase, notify tenant, open incident with receiver owner
Slow receiver Timeouts, high latency Apply per-subscription timeout; recommend receiver queueing
Endpoint drift NXDOMAIN/SSL error Suspend subscription; alert admin; require endpoint verification
Replay storm Burst load on receiver Rate limit replays; allow windowed replay; communicate schedule

Consumer Guidance (best practices)

  • Verify signatures and timestamp freshness before processing.
  • Idempotent handlers using Webhook-Event-Id; store processed IDs for TTL ≥ ageing window.
  • Respond fast (≤ 2s) and offload heavy work to your own queue.
  • Return 2xx only when processing is safely enqueued; otherwise 4xx/5xx to trigger retry.
  • Keep endpoints stable; use staged rotation flows for secrets and URL changes.

Extensibility & Reusability

  • The service is product-agnostic: event routing is driven by type, filters, and tenant policy, not by hardcoded domains.
  • Supports multiple schema versions per event type; consumers pick via subscription preference.
  • Pluggable signing providers (HMAC default; optional asymmetric signing for advanced partners).
  • Integrates easily into other platforms: expose Manager API, Dispatcher wired to their bus, and reuse Schema Catalog.

Solution Architect Notes

  • Keep the Dispatcher stateless; rely on Delivery Store for idempotency and state.
  • Treat webhooks as untrusted egress: validate destinations, cap concurrency, and isolate network egress where possible.
  • Always surface tenant and subscription context in traces and logs to debug quickly.
  • Encourage partners to adopt schema validation using the exposed Schema Catalog, and provide SDKs where adoption is strategic.

Admin Console & Tenant Portal

Purpose

Offer a reusable, multi-tenant UI that any generated SaaS product can embed or extend. The Admin Console and Tenant Portal provide self-service administration, usage/metering visibility, feature toggle management, audit exploration, and embedded observability. They enforce RBAC/ABAC, respect edition entitlements, and present a consistent UX across products.


Audience & UX Principles

  • Tenant Administrator: configure identity/SSO, manage users/roles, set flags/overrides, view invoices and audit logs.
  • Billing & Finance Operator: review subscriptions, invoices, usage reports.
  • Support Agent: read-only diagnostics, tenant health, and audit traces (PII-redacted).
  • End User: minimal profile/settings; app-specific modules only when permitted.
  • Platform Operator (SRE): operator console views (environment health, tenant routing, migration status).

UX principles: least privilege, edition-aware UI (hide/disable), zero-trust by default (no local admin bypass), explainability (why a control is disabled), fast path to observability (one click to traces/logs filtered by tenant).


UI Architecture & Embedding

Composition approach

  • Shell + Modules architecture (microfrontends optional). Shared shell provides auth, navigation, theming, telemetry, and locale.
  • Module types:

  • Core Modules (always present): Tenant Management, Users & Roles, Usage & Billing, Feature Flags & Overrides, Audit Explorer, Webhooks, Security & SSO, Observability.

  • Product Modules (pluggable): domain-specific pages contributed by each bounded context (e.g., Projects, Catalog).

Embedding patterns

  • Standalone (hosted by the platform) or Embedded (iframe/Web Component) inside product portals.
  • RBAC-aware Navigation: menu items and routes emitted by module manifests and filtered by role/entitlements.
flowchart LR
  Shell[Console Shell] --> Auth[OIDC Auth]
  Shell --> Nav[RBAC Navigation]
  Shell --> Telemetry[OTel Web SDK]
  Shell --> Core[Core Modules]
  Shell --> Product[Product Modules]

  Core --> Tenants[Tenant Management]
  Core --> Users[Users & Roles]
  Core --> Flags[Feature Flags & Overrides]
  Core --> Billing[Usage & Billing]
  Core --> Audit[Audit Explorer]
  Core --> Hooks[Webhooks]
  Core --> SSO[Security & SSO]
  Core --> Obs[Embedded Observability]
Hold "Alt" / "Option" to enable pan & zoom

RBAC Mapping (illustrative)

Area / Page tenant_admin billing_admin support_agent member operator
Tenant Profile & Residency ✓ (read)
Users & Roles ✓ (read)
Feature Flags & Overrides ✓ (read)
Usage & Billing ✓ (read) ✓ (read) ✓ (read)
Audit Explorer ✓ (PII-masked) ✓ (PII-masked)
Webhooks ✓ (read)
Security & SSO ✓ (read)
Embedded Observability ✓ (read)
Product Modules (domain) read if granted read if granted ✓ (read)

ABAC examples: region constraint for PII views (tenant.region == user.region), data-class gates for audit payloads.


Key Modules & Responsibilities

Tenant Management

  • View/update tenant profile, residency, isolation level (with guardrails).
  • Lifecycle actions: suspend/resume, export, scheduled deletion workflow.
  • Connection/partition info (read-only), migration status.

Users & Roles

  • Invite users, assign roles, JIT provisioning visibility (SCIM for Enterprise).
  • Session overview, forced logout, MFA settings (when supported).

Feature Flags & Overrides

  • Browse flags by context; evaluate flag in a what-if panel (tenant/user/edition).
  • Create tenant overrides with audit reason; kill-switches surfaced prominently.
  • Safe rollout helpers (percentage/targeting where applicable).

Usage & Billing

  • Usage charts (requests, storage, compute) with cursor pagination for raw events.
  • Subscription plan, invoice history, payment method (link-out to provider portal via ACL).
  • Quotas, rate limits; “close to limit” alerts and upgrade CTAs (edition-aware).

Audit Explorer

  • Search by actor, action, resource, traceId, and date window.
  • PII-aware rendering (tokenized values; reveal requires extra privilege).
  • Export signed bundles (JSON/CSV) with chain of custody metadata.

Webhooks

  • CRUD subscriptions, rotate secrets (dual-key), replay deliveries (windowed).
  • Delivery attempts table, filter by status code, trace to source event.

Security & SSO

  • Configure SSO (OIDC/SAML), SCIM endpoints, conditional access toggles (where provided via federation).
  • API keys/pats (if allowed), with strict scopes and expirations.

Embedded Observability

  • Embed pre-filtered dashboards (tenantId, edition).
  • Trace search (“Follow a request”) and log tail with PII redaction.
  • Error budget widgets aligned with tenant edition.

Data & Integration Flows

sequenceDiagram
  participant UI as Admin Console
  participant GW as API Gateway
  participant TEN as Tenant Service
  participant CONF as Config/Flags
  participant BILL as Billing
  participant AUD as Audit
  participant OBS as Observability

  UI->>GW: GET /v1/tenants/{id}
  GW->>TEN: AuthZ + fetch tenant
  TEN-->>GW: tenant profile (+residency, isolation)
  GW-->>UI: profile

  UI->>GW: GET /v1/flags?tenantId=...
  GW->>CONF: list flags + overrides
  CONF-->>GW: flags
  GW-->>UI: flags

  UI->>GW: GET /v1/usage?window=30d
  GW->>BILL: aggregate usage
  BILL-->>GW: usage series
  GW-->>UI: charts

  UI->>GW: GET /v1/audit?traceId=...
  GW->>AUD: query
  AUD-->>GW: results (PII-masked)
  GW-->>UI: render

  UI->>OBS: iframe embed with signed view token (tenant filter)
Hold "Alt" / "Option" to enable pan & zoom

Tenancy, Security & Privacy

  • Tenant context mandatory: every UI call carries X-Tenant-Id; ambiguous → blocked at edge.
  • Edition awareness: modules/features hidden or disabled when not entitled; tooltips explain required edition.
  • PII hygiene: redaction by default; View Sensitive Data requires elevated role + just-in-time approval with audit.
  • Audit everything: admin actions, flag changes, replay requests, SSO changes—all emit audit events with actor and reason.
  • Session security: short-lived access tokens with silent refresh; session fixation and clickjacking protections; CSP locked to allowed origins.

Observability & SLOs

  • Front-end telemetry: OTel Web SDK emits ui.action and ui.view spans; correlates with backend via traceparent.
  • User-centric SLOs
    • Console availability ≥ 99.9%
    • p95 nav-to-data render ≤ 500 ms (cached) / 1500 ms (cold)
    • Feature toggle apply-to-effective latency ≤ 10 s
  • Dashboards: per-tenant/module latency, error rates, UI Web Vitals (LCP/CLS), and admin risky-action counters.

Accessibility, Internationalization & Theming

  • Accessibility: WCAG 2.1 AA baseline; keyboard-only navigation; screen-reader labels on critical controls.
  • i18n: message catalogs; locale from user profile; number/date formatting per locale.
  • Theming: light/dark + tenant branding (logo/accent); ensure contrast ratios remain compliant.

Performance & Resilience

  • Code splitting per module; prefetch on likely nav paths.
  • Client-side caching with ETags; optimistic updates for non-critical settings.
  • Graceful degradation: if observability embedding is down, show cached summaries with a banner.

Extension Points

  • Module Manifest: JSON that declares routes, required scopes, and nav entries; the shell loads and gates them.
  • Action Hooks: before/after save for flags, webhooks, and SSO to inject product-specific validation.
  • Deep Links: shareable URLs encoding tenant context and filters (/audit?tenant=t-123&trace=...).

Solution Architect Notes

  • Keep policy enforcement at API boundaries; the UI only reflects decisions, it doesn’t make them.
  • Treat Feature Flags as UX surface for controlled rollout; every toggle change must be observable and reversible.
  • Make embedded observability tenant-safe: scoped tokens, short TTL, and server-side filtering.

Data Lifecycle, Privacy & Compliance

Purpose

Define how data is classified, minimized, protected, retained, and erased across the platform. Establish privacy-by-design controls, tenant-aware retention/TTL, encryption standards, data-subject request (DSR) flows, and auditability. Provide compliance overlays (e.g., GDPR, HIPAA, SOC 2) that products can enable per tenant/edition or per industry pack—without code changes.


Data Classification

Class Examples Handling Rules Storage Defaults Access
Public Docs, status messages Cacheable; no PII Blob (public-read via signed URLs if needed) Unrestricted
Internal Feature metadata, templates No external exposure; redact in logs SQL/Blob Staff with role gating
Confidential Tenant config, usage metrics TLS in transit, TDE at rest; masked in logs SQL + Redis cache (TTL) Tenant-scoped; RBAC/ABAC
PII Name, email, IP, user identifiers Minimize, pseudonymize where possible; strict audit SQL (column-level protection) Least privilege, purpose-bound
Sensitive PII SSN, passport, health data Tokenize or encrypt at field level; strong purpose limits SQL with field-level encryption Restricted roles; extra approvals
Secrets API keys, webhooks secrets Never in logs; KV only; rotate ≤ 90d Key Vault, no app DB Break-glass only
Audit/Legal Audit events, access logs Append-only; WORM retention; exportable SQL/Blob (immutability options) Operators/legal with controls

Principles: collect the minimum, retain the minimum, process for stated purposes only, never log raw PII or secrets.


Data Minimization & Privacy Patterns

  • Pseudonymization: replace direct identifiers with stable tokens; keep mapping in a protected table or vault.
  • Selective collection: edition/pack toggles gate collection of optional attributes (e.g., precise location).
  • PII scrubbing in telemetry: centralized log processors reject events containing PII markers; drop or tokenize.
  • Scoped queries: APIs expose aggregated/filtered views by default; raw exports require elevated role + audit reason.

Retention & TTL Policies (defaults)

Dataset Default TTL/Retention Disposal Method Notes
Tenant profile/config Active + 90 days after deletion Hard-delete after grace Soft-delete window for recovery
Usage raw events 90 days Roll-up + delete raw Keep aggregates 13 months
Audit log 7 years (configurable) WORM archival then purge Legal hold overrides TTL
Webhook deliveries 30 days Purge Enterprise can extend (e.g., 90 days)
Job histories 30 days Purge Metrics retained separately
Backups 35 days rolling Expire snapshots Encrypted backups only

Retention can be overridden by tenant contract (e.g., healthcare pack) and by regulatory overlays. All overrides are traceable.


Encryption & Key Management

  • In transit: TLS 1.3 everywhere; mTLS inside the cluster/mesh.
  • At rest: TDE for SQL/Blob; CMEK (customer-managed) optional per tenant/region.
  • Field-level: Sensitive PII encrypted with per-field keys or envelope encryption; keys in Key Vault.
  • Rotation: keys/secrets rotated ≤ 90d; certs tracked with expiry alerts; dual-key windows during rotation.
  • No plaintext secrets in app configs or logs; secrets read at runtime via workload identity.

DSR (Data Subject Requests) & Erasure Flows

Supported requests: access/export, rectification, deletion (erasure), restriction (hold), portability.

sequenceDiagram
  participant R as Requestor (Tenant Admin/User)
  participant AC as Admin Console
  participant GW as Gateway
  participant DSR as DSR Orchestrator
  participant SVC as Domain Services
  participant AUD as Audit

  R->>AC: Submit DSR (access/erasure)
  AC->>GW: POST /v1/dsr
  GW->>DSR: create_case(tenantId, subjectId, type)
  DSR->>SVC: fanout(workflow tasks per context)
  SVC-->>DSR: status/proofs (export bundle, erasure markers)
  DSR-->>AC: progress events
  DSR->>AUD: record actions (who/what/when/why)
  AC-->>R: Completion & evidence package
Hold "Alt" / "Option" to enable pan & zoom

Erasure rules

  • Soft-delete first, then scheduled hard-delete after grace window unless legal hold.
  • Propagate to caches, search indexes, and derived stores; rebuild affected aggregates.
  • Audit trail remains (immutable), but direct identifiers are tokenized or replaced with non-reversible references.

Compliance Overlays (enable per tenant/product)

Overlay Scope Key Controls
GDPR EEA/EU residents DSR flows, consent records, data minimization, cross-border transfer safeguards
HIPAA PHI in US healthcare BAAs, audit controls, access logs, encryption, breach notifications
SOC 2 Trust Services Criteria Change management, access control, incident response, monitoring
PCI DSS Payments No PAN storage or tokenize; provider ACL; quarterly scans
Data Residency Contractual Region-locked storage and processing, DR within region pair only

Overlays are policy bundles (configs + CI checks + runtime gates) and can be attached by edition/industry pack.


Policy Enforcement Points

Layer Enforcement
Edge/Gateway Consent headers, tenant residency checks, geo-fencing, edition policy hints
Service APIs RBAC/ABAC checks for data classes; PII-safe serializers; explicit export endpoints
Domain Logic Purpose-bound processing; blockers for collecting unnecessary attributes
Repositories RLS/tenant guards; column protection; query allow-lists
Pipelines/CI Static scans for PII in code/tests; schema linters; policy-as-code
Observability PII detectors in logs; trace attributes marking data class; audit correlations

Auditability & Evidence

  • Immutable audit: append-only events for access, changes, exports, erasures, consent updates (actor, reason, traceId).
  • Evidence bundles: machine-readable (JSON) + human-readable (PDF/CSV) export for DSR/compliance audits.
  • Chain of custody: signed exports with hash manifest; storage with retention & legal hold.

  • Consent records: versioned, tenant-scoped; changes audited with reason.
  • Notices: data use and tracking disclosures in product UIs; link to privacy policy versions.
  • Telemetry consent: opt-in/opt-out flags where legally required; defaults by region.

Observability & SLOs (privacy-centric)

Signal Target Notes
PII-in-logs detections 0 Fail CI and alert in prod
DSR completion time (access/erase) ≤ 30 days (legal max), target ≤ 7 days Per tenant
Key/cert rotation on time 100% within policy Alerts at 15/7/3 days pre-expiry
Residency violations 0 Block at edge; SEV-1 if detected
Audit write success 99.99% Buffer & backpressure during spikes

Failure Modes & Mitigations

  • Leak of PII in logs → Real-time detector trips; block sink; open incident; purge/redact; root-cause and tests added.
  • Erasure partial → Re-run workflow on failed contexts; maintain compensation tasks for reindexing/caches.
  • Residency drift (cross-region write) → Gate at edge; quarantine records; migrate back and audit.
  • Key rotation failure → Fall back to previous valid key in dual window; open incident; rotate manually.

Solution Architect Notes

  • Treat classification tags as part of the domain schema; they travel with events and telemetry.
  • Default all products to minimal collection; require explicit ADR for any expansion in data scope.
  • Keep privacy enforcement in code (not just policy docs): dedicated serializers, DTOs without PII by default, and safe logging wrappers.
  • Build DSR orchestration once and reuse across all contexts; success depends on consistent IDs and comprehensive lineage of derived data.

Quality Strategy: Testing & Verification

Purpose

Ensure that every SaaS solution generated by the factory meets a consistent quality bar before release. Define a layered testing strategy (unit, integration, contract, end-to-end, testing-in-prod), establish standards for fixtures and synthetic checks, and enforce quality gates in CI/CD pipelines. Quality must be measurable, automated, and edition/tenant-aware.


Testing Layers

Unit Tests

  • Scope: individual classes, functions, helpers.
  • Isolation: no I/O; mocks/stubs for dependencies.
  • Coverage: critical business logic, edge cases, input validation.
  • Tools: xUnit/MSTest + mocking libraries.

Integration Tests

  • Scope: service + real dependencies (DB, message bus, cache).
  • Run in ephemeral test environment or with containerized dependencies (Docker Compose).
  • Verify data persistence, event publishing, cache consistency.
  • CI: run after unit tests, on every PR.

Contract Tests

  • Consumer-Driven Contracts (CDC) for REST/gRPC/event contracts.
  • Validate request/response schema, headers (tenantId, traceId, edition), and error envelopes.
  • Ensure backward compatibility when evolving APIs.
  • Tools: Pact or custom CDC harness for events.

End-to-End (E2E) Tests

  • Full user flows (sign-up → tenant onboarding → subscription → flag evaluation).
  • Run against preview environments per PR and nightly in staging.
  • Browser/UI tests for tenant portal & admin console (Playwright/Selenium).
  • Must verify multi-tenancy invariants (no cross-tenant leaks).

Testing in Production (TiP)

  • Synthetic checks: simulate user sign-in, onboarding, and critical APIs continuously from multiple regions.
  • Canary releases verified by synthetic flows before general rollout.
  • Guardrails: PII-free test tenants with synthetic data only.

Synthetic Checks & Monitoring

  • Synthetic tenants: seeded with representative data (usage, billing, flags, notifications).
  • Continuous checks:
    • Auth & token issuance
    • Tenant onboarding latency
    • Flag evaluation correctness
    • Billing usage metering
    • Webhook delivery (to test sink)
  • Alerting: failures trigger incident; linked to error budgets.

Test Data & Fixtures

  • Fixture library with tenant archetypes (SMB, Mid-market, Enterprise).
  • Default seeds: product editions (Free/Standard/Enterprise), feature flags, subscriptions.
  • Synthetic PII: generated datasets for GDPR-safe testing (never real customer data).
  • Replayable scenarios: standard JSON/SQL seeds for onboarding, billing events, AI orchestration.
  • Secrets: fake keys/tokens for tests; real secrets only in prod via Key Vault.

Quality Gates in CI/CD

  • Unit test coverage: ≥ 80% for critical domains, reported in PR.
  • Integration/contract tests: must pass on PR merge; CDC verified against consumers.
  • Static analysis: code quality, security scans (PII detection, secret scanning).
  • E2E smoke tests: run on preview env per PR; block merge if failures.
  • Performance regression checks: latency budgets per context enforced in load tests.
  • Observability smoke tests: ensure traces/logs/metrics present in preview env.

Observability of Quality

  • Dashboards show test pass rate trends, time-to-fix broken builds, and coverage by bounded context.
  • Trace synthetic flows with special tenant IDs (synthetic-*) for easy filtering.
  • Alerts if coverage drops below baseline or if repeated flaky tests exceed threshold.

Risks & Mitigations

Risk Impact Mitigation
Flaky E2E tests Erode trust, block CI Quarantine flaky tests; retry-once policy; root-cause analysis required
Synthetic checks masking real issues Blind spots Expand test tenant archetypes; rotate scenarios regularly
Contract drift Breaking downstream consumers Enforce CDC tests in CI; require ADR for breaking changes
PII in test data Compliance risk Always use synthetic/fake data; automated scanners in CI

Solution Architect Notes

  • Build one reusable test harness per layer (unit, integration, CDC, E2E) and apply it across all generated services.
  • Treat synthetic tenants as production monitoring assets: they validate real infrastructure and SLOs continuously.
  • Quality is enforced by automation, not heroics; broken builds must be fixed before merge.
  • Integrate quality dashboards into the Admin Console for operator visibility (test status, SLO compliance, error budgets).

CI/CD & Release Strategy

Purpose

Define a standardized, reusable CI/CD blueprint for all SaaS products generated by the factory. The strategy ensures consistent build → test → promote → release flows with policy gates, preview environments, artifact signing, and progressive delivery patterns. It provides ready-to-use pipeline templates in Azure DevOps, extendable to GitHub Actions or other CI/CD platforms.


Branching & Source Strategy

  • Main Branch (main) Always releasable; mirrors production environment.
  • Feature Branches (feature/*) Used for development; PRs required for merge; auto-preview environment deployment.
  • Release Branches (release/*) Optional; for staged rollouts, LTS or regulated products.
  • Hotfix Branches (hotfix/*) Patch critical issues in production; merge back to main and develop.

Policy: All merges to main require passing quality gates (tests, security scans, coverage, lint).


Pipeline Topology

Stages

  1. Build

    • Restore dependencies, compile, lint, unit test, coverage report.
    • Build container images (multi-arch if needed).
    • Sign artifacts and produce SBOM (Software Bill of Materials).
  2. Test

    • Integration + contract tests against containerized dependencies (SQL, Service Bus, Redis).
    • Security scans (SAST, dependency CVE checks, IaC validation).
    • Quality gates: block on failing tests, PII detectors, or critical vulnerabilities.
  3. Package & Publish

    • Push signed images to Azure Container Registry (ACR).
    • Publish Helm charts, IaC bundles (Bicep/Pulumi), and API contracts (OpenAPI/Protobuf).
  4. Deploy (Preview/Stage/Prod)

    • Deploy via Helm to ACA/AKS depending on target.
    • Preview environments per PR (auto-provisioned; ephemeral).
    • Stage mirrors production topology for final validation.
  5. Release

    • Progressive delivery patterns (blue/green, canary).
    • Feature flag toggles for dark launches.
    • Automatic rollback if error budget breached.

Artifact Flow

flowchart LR
  Code[Source Repo] -->|PR| Build
  Build --> Test
  Test --> Package
  Package --> ACR[(Azure Container Registry)]
  Package --> Helm[Helm Charts Repo]
  ACR --> StageEnv[Staging Env]
  Helm --> StageEnv
  StageEnv --> ProdEnv[Production Env]
  ProdEnv --> Users[Tenants]
Hold "Alt" / "Option" to enable pan & zoom
  • Immutable artifacts: all builds produce signed, versioned images (app:1.2.3+commit.sha).
  • SBOM & provenance: stored in artifact registry; verifiable against policy gates.

Preview Environments

  • Every PR spins up an ephemeral namespace with isolated ingress (pr-123.factory.example.com).
  • Uses feature branch images + seeded synthetic tenants.
  • Destroyed automatically on PR close/merge.
  • Includes OTel telemetry, synthetic flows, and seeded test data.

Policy Gates

Enforced in pipelines and at merge time

  • Linting & Static Analysis: must pass with no critical issues.
  • Unit/Integration Coverage: ≥ 80% for critical services.
  • SAST & Dependency Scans: block on CVEs above CVSS 7.
  • Contract Tests: must pass against consumer suites.
  • Compliance Checks: PII/secret scanning, license compliance.
  • Artifact Signing: unsigned artifacts rejected downstream.

Progressive Release Patterns

  • Blue/Green Deploy new version alongside old, switch traffic via Gateway route. Rollback = switch back.
  • Canary Route fraction of traffic (e.g., 5%/10%/25%) to new version; auto-promote or rollback based on SLOs.
  • Dark Launch Release behind feature flag; only test tenants see feature until flag flipped.
  • Ring-Based Rollout by tenant tier (internal → Free tenants → Standard → Enterprise).

Environment Topology

  • Dev: local + CI ephemeral test containers.
  • Preview: per PR; ephemeral; seeded with synthetic tenants.
  • Stage: mirrors Prod, stable integration tests; nightly E2E.
  • Prod: multi-region active/active for critical services, edition-aware quotas.

Observability in CI/CD

  • Pipelines emit telemetry (duration, pass/fail, flaky tests).
  • Deployment traces tagged with commit SHA, PR ID, and tenant context for preview.
  • Release dashboards: SLO compliance, error budgets, rollout progress.

Risks & Mitigations

Risk Impact Mitigation
Long-lived preview envs Cost, stale tenants TTL auto-cleanup, manual extend flag
Flaky tests blocking merges Slows releases Retry-once policy, flaky test quarantine
Progressive release misconfig Partial outage Automated rollback, rollback runbooks
Supply chain attack Compromise of dependencies Artifact signing + SBOM + dependency scanning

Solution Architect Notes

  • Treat pipelines as product templates: every generated SaaS inherits the same CI/CD skeleton.
  • Keep artifact promotion immutable: rebuilds are not allowed between Stage and Prod.
  • Bake error budgets into progressive rollout logic (auto-stop rollout if violated).
  • Preview environments are not optional—every PR must deploy to ensure shift-left validation.

Infrastructure & Platform Runtime

Purpose

Define a portable, Azure-first runtime topology for the SaaS Factory, covering compute (AKS/ACA), messaging (Service Bus), data (Azure SQL + optional Mongo/Redis/Blob), identity (Managed Identities, Key Vault), networking (private endpoints, WAF), and observability (OTel → Prometheus/Grafana/Logs). This baseline is multi-tenant aware, enforces zero trust, and is designed to be materialized via IaC (Pulumi/Bicep).


Resource Topology (high-level)

flowchart TB
  subgraph Edge
    AFD[Azure Front Door + WAF]
  end

  subgraph App
    AKS[(AKS)]:::core
    ACA[(Azure Container Apps)]:::core
    GATE[API Gateway (YARP)]
    SVC[Core Microservices]
    JOBS[Jobs/KEDA/Hangfire]
  end

  subgraph Data
    SQL[(Azure SQL PaaS)]
    MONGO[(MongoDB Atlas/AKS opt)]
    REDIS[(Azure Cache for Redis)]
    SB[(Azure Service Bus)]
    BLOB[(Azure Blob Storage)]
    KV[(Azure Key Vault)]
  end

  subgraph Observability
    OTL[OTel Collectors]
    LOGS[Log Analytics / Storage]
    GRAF[Prometheus/Grafana]
  end

  AFD -->|TLS 1.3| GATE
  GATE -->|mTLS| SVC
  SVC -->|AMQP| SB
  SVC -->|ADO.NET| SQL
  SVC -->|Redis| REDIS
  SVC -->|Blob SDK| BLOB
  SVC -.-> MONGO
  JOBS --> SB
  AKS --> OTL
  ACA --> OTL
  OTL --> LOGS
  OTL --> GRAF
  SVC --> KV
  GATE --> KV

  classDef core fill:#0b6,stroke:#094,color:#fff;
Hold "Alt" / "Option" to enable pan & zoom

Compute Runtime

AKS (Kubernetes)

  • Best for high scale, service mesh (mTLS), advanced network policies, sidecars (e.g., OTel, authz), and custom autoscaling.
  • Use managed identities for pods, CNI with Calico (or Azure CNI), and PodSecurity standards (baseline/restricted).

ACA (Azure Container Apps)

  • Simpler ops; KEDA-native autoscaling on HTTPQ/CPU/Service Bus lag.
  • Ideal for jobs, event processors, and smaller footprints.
  • Can run alongside AKS (hybrid) for jobs plane and burst capacity.

Baseline rule: same container images and contracts run on both; runtime choice is an operational concern per product/environment.


Networking & Identity

  • Perimeter: Azure Front Door + WAF terminates public TLS; only the Gateway is internet-exposed.
  • Private Networking: services/data in private subnets; deny by default.
  • Private Endpoints: Azure SQL, Service Bus, Storage, and Key Vault via Private Link; no public exposure.
  • mTLS Everywhere: Gateway↔Services and Service↔Service; cert rotation automated.
  • Workload Identity: AKS/ACA workloads use Managed Identities; no static secrets in pods.
  • Egress Control: user-defined routes + firewall for outbound; webhook egress allow-lists.

Data Layer

  • Azure SQL (Primary Store)

    • Pooled tenancy by default (RLS or repository guards).
    • Geo-replication, automated backups, PITR; TDE + optional CMEK.
    • Per-context schemas with strict ownership; migrations gated in CI.
  • MongoDB (Optional)

    • For document-heavy payloads (e.g., notification templates).
    • Private peering or self-hosted on AKS with operator; encryption at rest.
  • Redis

    • Low-latency flag evaluation, token/claims cache, idempotency windows.
    • TLS-only; tenant-prefixed keys; eviction policies controlled per module.
  • Blob Storage

    • Artifacts, exports, evidence bundles, AI outputs.
    • Container per context; immutability policies for audit exports.

Messaging Layer

  • Azure Service Bus
    • Topics/queues per bounded context; DLQs enabled; MaxDeliveryCount tuned per workload.
    • MassTransit implements outbox/inbox, retries with jitter, and saga orchestration.
    • Namespace partitioning (prod vs non-prod; per region).
    • Private Endpoints; role assignments to workloads via managed identity.

Observability Stack

  • OpenTelemetry Collectors as DaemonSet/sidecar (AKS) or as ACA app for ingestion.
  • Metrics to Prometheus (managed or self-hosted), dashboards in Grafana.
  • Logs to Log Analytics + long-term archive to Blob.
  • Trace context propagated via W3C (traceparent), with mandatory attributes (tenantId, edition).
  • Alerting: SLO-based alerts wired to incident channels; error budgets shown in dashboards and Admin Console.

Security & Compliance Baselines

  • Zero Trust: no implicit trust; all traffic authenticated and authorized.
  • Key Vault for secrets, keys, and certs; automated rotation; Just-In-Time access for operators.
  • Container security: image signing/SBOM; policy gates deny unsigned or CVE-violating images.
  • Data residency: per-tenant region enforced at edge; DR within paired region only.
  • Audit: append-only audit store; WORM policies for exports.

Environments & Promotion

  • Dev/Preview: ephemeral namespaces per PR; seeded synthetic tenants.
  • Stage: mirrors prod; shadow/canary experiments; contract & load testing.
  • Prod: multi-zone AKS/ACA; optional active-active across regions for critical services.
  • Promotion: immutable artifacts from ACR; Helm/OPA policies guard releases.

Autoscaling & Resilience

  • HTTP services: HPA (AKS) on CPU/RPS/p95 latency; ACA on RPS/queue lag.
  • Workers: KEDA triggers on Service Bus depth/lag; DLQ replayers isolated.
  • Resilience policies: timeouts, retries (idempotent), circuit breakers, bulkheads defined per dependency.
  • Chaos: scheduled chaos experiments in non-prod; steady-state SLOs verified.

Resource Graph (baseline components)

Area Azure Resource
Edge Front Door Standard/Premium, WAF Policy, Public DNS
Compute AKS (node pools per workload class), ACA Environment
Networking VNets, Subnets, Private DNS Zones, Firewall/UDR
Identity Managed Identities, Key Vault
Messaging Service Bus Namespace (Topics/Queues, DLQ)
Data Azure SQL (Geo-repl), Redis, Blob Storage, optional Mongo
Observability Log Analytics, Managed Prometheus/Grafana, OTel Collectors
Security Defender for Cloud policies, Microsoft Entra ID Conditional Access (for ops)

IaC Strategy (Pulumi-first; Bicep optional)

  • Stacks per environment: dev, stage, prod, with region parameters and SKU right-sizing.
  • Composition: core platform stack (network, identity, observability) + workload stacks (bus, data, compute).
  • Policy-as-code: OPA/Conftest in CI to validate IaC against security/compliance baselines.
  • Outputs: connection endpoints via Private Link, DNS zones, MI principal IDs, and signed URLs for bootstrap.

Failure Modes & Recovery

  • Regional Service Bus degradation → autoscale consumers, buffer at producers with backpressure; consider cross-namespace failover for enterprise-grade products.
  • SQL hotspot → promote tenant to schema/DB; enable read replicas for heavy reads; cache hot lookups in Redis.
  • Key Vault throttling → application-side caching with TTL; staggered rotation windows.
  • Front Door anomaly → failover to secondary profile; serve maintenance page; keep Admin APIs internal-only until edge recovers.

Solution Architect Notes

  • Start small with ACA where possible; graduate to AKS for mesh, sidecars, or complex multi-tenant scaling.
  • Keep Private Link ubiquitous—do not allow public endpoints on data plane resources.
  • Make observability a dependency: deploy OTel & dashboards before tenant-facing services.
  • Enforce workload identity end-to-end; any secret-based exception must be time-bound with an explicit waiver and monitoring.

FinOps & Cost Governance

Purpose

Ensure the SaaS Factory produces solutions that are cost-efficient, scalable, and predictable across tenants and editions. This requires embedding quotas, autoscaling rules, budget alerts, and cost dashboards into the baseline platform. By enforcing edition-aware policies, the factory guarantees fairness, prevents noisy-neighbor risks, and aligns operating costs with business revenue models.


FinOps Principles

  • Cost visibility by tenant/edition: every workload and resource tagged with tenantId, edition, productId.
  • Guardrails over gates: developers can ship features quickly but cannot bypass budget alerts or quota checks.
  • Edition-aware scaling: Enterprise tenants receive higher quota ceilings; Free tenants constrained by default.
  • Performance and cost together: every feature must have a defined capacity model and cost impact.

Capacity & Scaling Model

  • Compute (AKS/ACA)
    • Autoscale on CPU, RPS, and queue depth (KEDA).
    • Quotas mapped to edition (e.g., Free = 1 vCPU cap, Enterprise = 8 vCPU burst).
  • Messaging (Service Bus)
    • Namespace throughput units allocated by tenant tiers.
    • DLQ growth alerts trigger cost/performance investigation.
  • Data (SQL, Blob, Redis)
    • Tenant-level caps (rows, storage GB) with upgrade paths.
    • Elastic pools for pooled tenants; isolated instances for Enterprise.
  • Observability
    • Log retention tiered: Free = 7d, Standard = 30d, Enterprise = 90d+.
    • High-cost metrics (e.g., detailed tracing) reserved for Enterprise or by opt-in flag.

Edition-Aware Quotas

Resource Free Standard Enterprise
API Rate Limit (req/min) 60 600 3000
Storage (GB) 1 50 500+
Concurrent Jobs 5 50 200
Webhooks 1 sub, 7d logs 5 subs, 30d logs 20 subs, 90d logs
Observability Retention 7 days 30 days 90 days

Quotas are enforced at gateway, service API guards, and DB-level policies.


Cost Dashboards & Tagging

  • Tagging policy: all resources tagged with env, product, tenantId, edition.
  • Dashboards: Azure Cost Management + Grafana integration, filtered by product/tenant/edition.
  • KPIs tracked:
    • Cost per tenant per month
    • Cost per active user (CPU hours / requests)
    • Cost of observability (log/metric ingestion rates)
    • Idle resource cost vs. active usage

Budget Alerts & Guardrails

  • Budgets set per environment (Dev/Stage/Prod) and per product.
  • Alerts triggered at 50/75/90/100% thresholds.
  • Automated Slack/Teams notifications to product teams.
  • Stopgap policies: throttle non-critical workloads if spend breaches thresholds in Free/Trial environments.

Performance Budgets

  • Each service defines baseline throughput, latency, and cost envelope.
  • Perf tests run as part of release process:
    • Tenant onboarding latency ≤ 60s (p95)
    • API p95 latency ≤ 200 ms (reads) / 350 ms (writes)
    • Cost per 1000 requests within target band (< $0.05 for pooled tenants)
  • Perf regression gates in CI/CD: load tests in stage environment with synthetic tenants; fail build if >10% degradation.

Observability for FinOps

  • Traces include tenantId + cost-relevant metrics (CPU time, DB queries, cache hits).
  • Metrics: cost_per_tenant, quota_consumed_percent, autoscale_events.
  • Alerts:
    • quota_consumed_percent > 90% for Standard/Enterprise → notify tenant admin via portal.
    • autoscale_events > threshold → investigate cost/perf drift.

Risks & Mitigations

Risk Impact Mitigation
Noisy neighbors in pooled DB Performance degradation Quotas, RLS, hot-tenant promotion to schema/db
Cost overrun in observability High OPEX Tiered retention, sampling, enterprise opt-in
Under-provisioned Enterprise tenant SLA breach Autoscale headroom, proactive capacity tests
Free tenants gaming system Abuse of resources Strict quotas, rate limits, throttling policies

Solution Architect Notes

  • Bake FinOps into templates: every service scaffold includes quota configs, cost tags, and perf test harness.
  • Expose cost visibility to tenants: portal shows per-tenant usage and cost breakdowns (transparency + upsell driver).
  • Use synthetic perf tenants in stage to continuously model cost-per-tenant at scale.
  • Cost + performance are first-class SLOs: treat regressions in either as blockers to release.

Governance, ADRs & Change Management

Purpose

Establish a structured governance model for architectural decisions, API contracts, and platform evolution. The goal is to ensure that every SaaS product generated by the factory is traceable, reviewable, and predictable in its decision-making, while allowing for safe innovation through structured change management.


ADR Process & Repository

  • Format: log4brains ADRs in Markdown (docs/adr/).
  • Structure:
    • Title & ID (incremental, 0001-title.md)
    • Status: Proposed / Accepted / Superseded / Rejected
    • Context → Decision → Consequences → Alternatives → References
  • Lifecycle:
    1. Draft ADR opened with PR.
    2. Peer review via Architecture Guild or Review Board.
    3. Accepted ADR merged → published in ADR site.
    4. If superseded, new ADR links back with rationale.
  • Templates: ADR template provided in docs/templates/adr-template.md.

Decision Lifecycle

  1. Trigger: new requirement, tech evaluation, incident, compliance need.
  2. Proposal: ADR draft with context/problem/alternatives.
  3. Review: async discussion in PR; optional Architecture Guild sync.
  4. Decision: maintainers approve/merge ADR.
  5. Implementation: tracked via linked Azure DevOps epics/features.
  6. Sunset: decision reviewed after 12–18 months, or upon major platform change.

Principles:

  • One decision per ADR.
  • ADRs are immutable once accepted (except metadata).
  • Supersession, not deletion.

Review Boards

  • Architecture Guild: senior engineers, product architects, security/privacy officers.
  • Review cadence: weekly triage; quarterly backlog review.
  • Charter:
    • Maintain decision traceability.
    • Balance innovation with stability.
    • Escalate major trade-offs (cost, compliance, security).
  • Voting model: consensus preferred, fallback majority.

Contract Versioning & Deprecation

API Contracts (REST/gRPC/events)

  • Source of truth in contracts/ folder (OpenAPI/Protobuf/JSON Schemas).
  • Versioning:
    • REST: /v1/..., /v2/...
    • gRPC: package service.v1
    • Events: eventName.v1, eventName.v2
  • Compatibility: additive changes allowed in same version; breaking changes → new major.

Deprecation & Sunset Policy

  • Announcement: contract flagged as deprecated in spec + docs.
  • Headers: Deprecation, Sunset, and Link to changelog returned in responses.
  • Timeline:
    • Minimum 12-month overlap for deprecated APIs.
    • Enterprise tenants can negotiate extended support.
    • Telemetry monitors usage of deprecated versions.
  • Removal: only after telemetry shows <1% usage + published sunset date passed.

Governance & Change Management

  • Config changes: go through Config Change Review (CCR) process, with audit trail in Git + change tickets.
  • Schema changes: must be backward compatible; validated in CI via migration tests.
  • Secrets/keys: rotated via automation; ADR required if new secret store introduced.
  • Edition policies: changes to edition entitlements logged as ADRs + documented in release notes.
  • Emergency changes: handled via expedited ADR (lightweight doc, ratify post-incident).

Observability of Decisions

  • ADR site (log4brains + MkDocs Material) embedded in developer portal.
  • Decision dashboards: show active ADRs, deprecated ones, and upcoming sunsets.
  • Traceability: every epic/feature in Azure DevOps links to one or more ADR IDs.

Risks & Mitigations

Risk Impact Mitigation
Decision sprawl Inconsistent approaches across services Central ADR repo + guild oversight
Slow reviews Blocks delivery Async PR reviews + time-boxed discussions
Deprecated APIs in use Security/compliance gaps Telemetry-driven enforcement; forced sunset dates
Lack of traceability Regulatory audit failures ADR portal + linked DevOps work items

Solution Architect Notes

  • Treat ADRs as first-class artifacts: they evolve the platform as much as code.
  • Encourage lightweight but frequent ADRs — better many small scoped decisions than one bloated doc.
  • Deprecation without telemetry is a blind spot: enforce contract usage dashboards.
  • Change governance should be templatized in the factory so every new product inherits the same discipline.

Risk Register, Security Exceptions & Waivers

Purpose

Establish a centralized, continuously updated risk register for the SaaS Factory and its generated products. Define a matrix-based scoring model (likelihood × impact), provide structured mitigation plans, and standardize the handling of security exceptions and waivers. This ensures risks are visible, time-bound, and owned, while maintaining regulatory and audit readiness.


Risk Matrix Model

Scoring Dimensions

  • Likelihood: Rare (1), Unlikely (2), Possible (3), Likely (4), Almost Certain (5)
  • Impact: Negligible (1), Minor (2), Moderate (3), Major (4), Critical (5)

Risk Level = Likelihood × Impact

Score Level Response
1–4 Low Accept; monitor
5–9 Medium Mitigate or accept with waiver
10–16 High Mitigation required, track in DevOps
17–25 Critical Immediate remediation, exec visibility

Example Risk Register

ID Risk Description Likelihood Impact Score Owner Mitigation Status
R-001 PII leakage in logs 3 5 15 (High) Security Eng OTel PII scrubbing, CI scanners, redaction lib Mitigation in progress
R-002 Multi-tenant data bleed (RLS misconfig) 2 5 10 (High) DB Architect Row-level security tests, tenantId invariant, contract tests Open
R-003 Stale TLS certs (expired) 3 3 9 (Medium) Ops Lead Automated cert rotation, alerts 30/7/3 days Closed
R-004 Dependency CVEs in OSS libs 4 4 16 (High) Eng Enablement Renovate/Dependabot, CVE scanning in CI Ongoing
R-005 Excess observability costs (log storm) 4 2 8 (Medium) FinOps Sampling, quotas, alerting Open

Security Exceptions & Waivers

Workflow

  1. Request: Team submits exception form with context, rationale, alternatives considered.
  2. Review: Security Board evaluates risk level and business justification.
  3. Approval: If granted, exception logged with expiry date and remediation plan.
  4. Tracking: Exception ID linked to DevOps work item; progress reviewed monthly.
  5. Expiry: Automatic reminders 30/7/1 days before expiration; must renew or close.

Waiver Metadata (template)

  • Waiver ID: SEC-WVR-2025-001
  • Requested by: Team/Owner
  • Scope: service/context affected
  • Risk reference(s): R-001, R-004
  • Justification: why no immediate mitigation possible
  • Expiry date: e.g., 90 days
  • Mitigation plan: concrete steps + timeline
  • Approver: Security Officer / Architecture Guild

Exception Categories

  • Technical Debt: Legacy library pending replacement.
  • Operational Constraint: Vendor service does not yet support mTLS.
  • Regulatory Delay: Awaiting legal guidance before applying stricter residency policy.
  • Business Urgency: Feature launch requires temporary compromise (time-boxed).

Governance & Review

  • Security & Architecture Board owns risk register reviews.
  • Monthly: review new risks, open waivers, and expired waivers.
  • Quarterly: risk posture reassessment; report to exec sponsors.
  • Audit: risk register and waiver log stored in Git (docs/governance/risks.md, docs/governance/waivers.md).

Observability of Risk Posture

  • Dashboards track:
    • Open risks by severity.
    • Exceptions expiring in next 30 days.
    • Distribution of risk categories (security, compliance, cost, operational).
  • Alerts: Slack/Teams notification for critical risk creation or waiver expiry.

Risks & Mitigations (meta-level)

Meta Risk Mitigation
“Paper-only” register (unused) Integrate with DevOps epics; risk must have linked tasks
Perpetual waivers Force expiry; auto-archive stale exceptions
Inconsistent scoring Use standard matrix; train reviewers; cross-review
Lack of visibility Dashboards & ADR references in portal

Solution Architect Notes

  • Waivers are not forever: each must be tied to a remediation timeline and sunset.
  • The risk register should evolve with ADRs—a major architectural decision should consider risk impact explicitly.
  • Treat the register as live operational telemetry, not a static doc.
  • Encourage engineers to proactively raise risks; the cost of under-reporting is far higher than managing a larger register.

Rollout & Tenant Onboarding / Cutover

Purpose

Provide a standardized, low-risk rollout framework for new products, editions, and tenants. Define phased rollout strategies, onboarding workflows, migration/cutover playbooks, backout strategies, and communication templates. Ensure tenant experience is consistent, compliant, and reversible in case of failures.


Rollout Strategy

Phased Rollouts

  • Internal First: internal tenants (synthetic + staff tenants) validate production infrastructure.
  • Pilot Tenants: 3–5 selected customers (Enterprise or early adopters).
  • Regional Expansion: enable per geography or residency zone.
  • General Availability: open onboarding to all editions.

Progressive Controls

  • Feature flags gate new capabilities.
  • Canary deployments split traffic by tenant cohort or edition tier.
  • Automated rollback if SLOs violated during rollout.

Tenant Onboarding Workflows

New Tenant Onboarding

  1. Tenant Admin signs up (via portal or API).
  2. Tenant entry created in Tenant Service (pooled by default).
  3. Isolation model applied (pooled/schema/db).
  4. Default edition + entitlements assigned.
  5. Baseline config + feature flags seeded.
  6. Synthetic smoke tests run under new tenant ID.
  7. Tenant marked “active” in registry.

Migration / Cutover Flow

  • Trigger: product upgrade, edition change, residency migration.
  • Steps:
    1. Freeze writes (short downtime window if needed).
    2. Export + transform data (schema/edition-aware).
    3. Import into new store (schema/db).
    4. Validate with synthetic checks (data count, checksum, smoke tests).
    5. Switch traffic at gateway (new edition endpoints).
    6. Monitor SLOs; keep old infra on standby for rollback window.
sequenceDiagram
  participant Admin as Tenant Admin
  participant Portal as Portal/API
  participant TenantSvc as Tenant Service
  participant Config as Config/Flags
  participant Billing as Billing Service
  participant Audit as Audit Log
  participant Ops as Ops/SRE

  Admin->>Portal: Sign-up / Edition change
  Portal->>TenantSvc: CreateTenant/Upgrade
  TenantSvc->>Config: Seed baseline flags
  TenantSvc->>Billing: Assign subscription plan
  TenantSvc->>Audit: Record onboarding/migration
  Ops->>TenantSvc: Run synthetic tests
  TenantSvc-->>Portal: Tenant ready (status=Active)
Hold "Alt" / "Option" to enable pan & zoom

Migration & Cutover Runbooks

  • Pre-Migration Checklist

    • Validate backups + PITR windows.
    • Confirm tenant metadata (ID, edition, residency).
    • Schedule migration window with comms.
    • Run dry-run migration in staging.
  • Cutover Checklist

    • Freeze tenant write operations.
    • Take backup/snapshot.
    • Run export → import pipeline.
    • Validate data integrity + run smoke tests.
    • Switch DNS/gateway route.
    • Monitor telemetry dashboards for anomalies.
  • Backout Plan

    • Rollback DNS/gateway to old infra.
    • Restore data snapshot.
    • Notify tenant of rollback and ETA for retry.

Communication Templates

  • Pre-Onboarding Email: “Your tenant environment is being provisioned. Expect ~15 minutes before services are active. You will receive confirmation when onboarding completes.”

  • Migration Notice: “We are upgrading your tenant to a new edition. A brief read-only window will occur from 02:00–02:30 UTC. Data integrity and continuity are guaranteed. In case of issues, we will revert within 30 minutes.”

  • Completion Notice: “Migration successful. All services are active. Please validate your workflows. If you encounter any issues, contact support with trace ID .”

  • Rollback Notice: “We reverted your tenant migration due to anomalies detected. Services remain available on the prior edition. We will reschedule and notify you once stability is confirmed.”


Observability & Readiness

  • Readiness checks: synthetic tenant login, flag evaluation, billing call, event publishing, audit log write.
  • Telemetry dashboards: show onboarding duration, failure rate, rollback events.
  • SLOs:
    • Onboarding p95 ≤ 10 minutes.
    • Migration success ≥ 99.5% without manual intervention.
    • Rollback readiness ≤ 30 minutes.

Risks & Mitigations

Risk Impact Mitigation
Data mismatch post-migration Tenant disruption Checksums + synthetic tests before traffic cutover
Rollback failure Extended outage Immutable backups, DNS-based rollback, ops runbooks
Communication gaps Customer dissatisfaction Standard comms templates; automated notifications
Edition misassignment Wrong entitlements Config seeding automation; validation against edition matrix

Solution Architect Notes

  • Treat onboarding as code: workflows defined in automation (Pipelines/Functions), not manual ops.
  • Every migration must have a rollback plan validated in non-prod.
  • Prefer progressive cutover (tenant cohorts) over “big bang” migrations.
  • Expose self-service onboarding APIs but enforce strict tenancy validation and audit logging.

Operations, DR & BCP

Purpose

Define the operational framework for running the SaaS platform in production, including incident response models, disaster recovery (DR) objectives, business continuity planning (BCP), and continuous improvement loops. Ensure resilience by codifying RTO/RPO targets, runbooks, escalation policies, and post-incident learning practices.


Incident Response Model

Principles

  • Always-on detection: observability signals drive automated alerting.
  • Severity-driven triage: classify incidents by impact scope (tenant, edition, platform-wide).
  • Clear ownership: each bounded context has an on-call roster with escalation paths.
  • Blameless culture: focus on systemic fixes, not individual blame.

Severity Levels

Sev Example Response Target
1 (Critical) Multi-tenant outage, data breach Respond ≤ 5 min, resolve ≤ 2h
2 (High) Single-tenant critical service loss Respond ≤ 15 min, resolve ≤ 4h
3 (Medium) Performance degradation, delayed jobs Respond ≤ 1h, resolve ≤ 24h
4 (Low) Cosmetic UI issue, minor bug Triage in backlog

Escalation

  • PagerDuty/Teams alerts → On-call engineer → Escalation to service lead → SRE/Architecture Guild if systemic.
  • Stakeholder comms via status page + tenant emails.

DR/BCP Targets

Disaster Recovery Objectives

  • RTO (Recovery Time Objective): ≤ 1 hour for core services, ≤ 4 hours for non-critical.
  • RPO (Recovery Point Objective): ≤ 15 minutes for transactional data (SQL, Service Bus), ≤ 1 hour for logs/telemetry.

Continuity Scenarios

  • Regional outage: failover to paired Azure region (e.g., West Europe ↔ North Europe).
  • Service degradation: reroute workloads to ACA fallback if AKS cluster impaired.
  • Data corruption: restore from PITR backups with integrity checks.
  • Identity outage: cached tokens + reduced-mode features until IdP restored.

Runbooks & Playbooks

Examples

  • Outage Playbook:

    1. Identify impacted tenants/editions.
    2. Trigger failover runbook (DNS cutover, service redeploy).
    3. Notify stakeholders.
    4. Verify SLO restoration.
  • Data Corruption:

    1. Halt writes.
    2. Restore PITR snapshot.
    3. Re-run integrity checks.
    4. Replay DLQ events.
  • Service Bus Saturation:

    1. Scale consumers via KEDA.
    2. Purge DLQ to holding store for replay.
    3. Tune retry backoff.
  • Tenant Migration Failure:

    1. Rollback to previous DB/schema.
    2. Reassign tenant routing.
    3. Notify tenant with comms template.

Runbooks stored in docs/runbooks/*.md; each includes steps, required roles, estimated timings, rollback procedures.


Post-Incident Review (PIR)

  • Held within 72 hours of Sev-½ incident.
  • Format: What happened? What was the impact? What went well? What failed? What do we improve?
  • Outputs:
    • Incident report (public/tenant redacted version + internal detailed version).
    • Linked DevOps items for remediation.
    • SLA/SLO impact recorded for trend tracking.

Blameless retrospectives mandatory for cultural reinforcement.


Operational KPIs & Continuous Improvement

Key Metrics

  • MTTR (Mean Time to Recovery)
  • MTTA (Mean Time to Acknowledge)
  • SLA compliance (%)
  • Error budget burn rate
  • Change failure rate (CFR)
  • Incident recurrence rate

Continuous Improvement Loop

  1. Collect KPIs + PIR findings.
  2. Update ADRs or runbooks as needed.
  3. Feed back into test automation (synthetic checks for past failures).
  4. Share learnings across product teams in guild sessions.

Observability & Automation

  • SLO dashboards per service, tenant-aware.
  • Error budget alerts tied to rollout progression (halt if exceeded).
  • Runbook automation: scripts for DNS cutover, database restore, tenant reroute.
  • Chaos engineering: scheduled DR drills to validate RTO/RPO in practice.

Risks & Mitigations

Risk Impact Mitigation
Unclear on-call ownership Delayed response On-call schedules automated; service owner registry
DR drills skipped False sense of security Mandatory quarterly tests with reports
PIRs not actioned Repeat incidents Link PIR items to DevOps with SLA
Overloaded on-call team Burnout Rotations, SLO budgets, escalation to guilds

Solution Architect Notes

  • Treat DR/BCP as code: IaC templates for failover infra, automated DNS cutovers, scripted PITR restores.
  • Ensure multi-tenant context in all incident workflows (know who’s impacted immediately).
  • Regularly test rollback and failover—don’t assume Azure SLAs replace DR.
  • Continuous improvement is the differentiator: every incident should make the platform stronger.

Real-Life Example SaaS Products

Purpose

Ground the high-level design in practical, scenario-based examples that demonstrate how the factory can generate distinct SaaS offerings using its reusable patterns. These examples highlight flexibility across industries, compliance requirements, and AI-first capabilities.


Example 1: SaaS CRM Lite

  • Target Tenants: SMEs and startups.
  • Core Contexts Used: Identity, Tenant Management, Config, Notifications.
  • Data Model: Pooled DB (multi-tenant shared schema with RLS).
  • Editions:
    • Free → 5 users, basic CRM (contacts, tasks).
    • Standard → unlimited users, reporting dashboards.
    • Enterprise → advanced workflows, SSO, API access.
  • Extensibility:
    • Webhooks for customer lifecycle events.
    • Slack and Microsoft Teams integration via outbound events.
    • REST APIs for partner add-ons.

Why it works: A simple product leverages the factory’s core scaffolding (identity, multi-tenancy, config) with minimal customization.


Example 2: Healthcare SaaS (HIPAA Overlay)

  • Target Tenants: Clinics and medical practices.
  • Overlays Applied: HIPAA compliance, immutable audit trails, PHI encryption, data residency enforcement.
  • Editions:
    • Standard → support for multiple clinics under one tenant.
    • Enterprise → per-facility isolation (schema-per-tenant or DB-per-tenant).
  • Integrations:
    • FHIR APIs to connect with EHR systems.
    • Secure patient notifications (SMS/email with consent).
    • Billing and insurance workflows.
  • Data Controls:
    • Tokenization of PHI in audit logs.
    • WORM storage policies for compliance exports.

Why it works: Demonstrates the policy overlays and regulatory guardrails built into the factory (GDPR/HIPAA packs).


Example 3: AI Knowledge SaaS

  • Target Tenants: Universities, training providers, research institutions.
  • Core Feature: AI-first orchestration via Semantic Kernel.
  • Capabilities:
    • AI tutors (multi-agent orchestration).
    • Knowledge search across uploaded materials.
    • Context-aware Q&A with tenant-isolated embeddings.
  • Edition Features:
    • Free → limited AI queries (e.g., 50/month).
    • Standard → shared embeddings, course recommendations.
    • Enterprise → private embeddings, multi-agent orchestration, ingestion of tenant-specific datasets.
  • Observability:
    • Per-tenant AI usage dashboards.
    • Quotas to prevent overuse and ensure fair cost allocation.

Why it works: Showcases factory support for AI workloads (semantic orchestration, tenant-aware embeddings, quotas) and highlights the ability to build future-ready products.


Key Takeaways

  • The factory is industry-agnostic: from lightweight CRM to regulated healthcare to AI-first products.
  • Edition and overlay packs adapt the same baseline platform for different compliance and market needs.
  • Multi-tenancy and extensibility patterns (webhooks, APIs, events) repeat across all products.
  • The platform’s design ensures reusability, speed-to-market, and compliance by default.

Conclusion & Summary

Why the Factory Exists

The ConnectSoft SaaS Factory was conceived to dramatically reduce time-to-market for SaaS solutions while embedding non-negotiable guardrails for security, compliance, and reliability. By abstracting away repeated engineering concerns — identity, tenancy, observability, compliance, CI/CD, FinOps — the factory allows product teams to focus on differentiated business value instead of reinventing the same platform foundations.


Core Pillars

The design presented in this HLD is anchored on four cross-cutting pillars:

  • Security by Design — OAuth2/OIDC, workload identities, encryption everywhere, data classification, policy enforcement.
  • Multi-Tenancy & Editions — pooled vs. isolated tenancy, per-tenant config/flags, edition overlays, migration paths.
  • Observability & Resilience — OTel-first telemetry, traces/logs/metrics, error budgets, SLO dashboards, chaos testing.
  • AI-First Orchestration — Semantic Kernel and Microsoft.Extensions.AI enable agentic workflows, intelligent assistants, and future-ready product capabilities.

Together, these pillars ensure every generated SaaS product is compliant, scalable, and adaptable.


Blueprint Completeness

Across the sections, the document has defined:

  • Vision & Personas → what problems we solve, for whom.
  • Bounded Contexts & Architecture → how services decompose, integrate, and evolve.
  • Non-Functional Guarantees → SLOs, privacy, compliance, resilience.
  • Operational Excellence → CI/CD pipelines, FinOps guardrails, risk registers, DR/BCP.
  • Practical Examples → how real SaaS products (CRM, Healthcare, AI Knowledge) emerge from the same templates.

This represents a complete high-level blueprint for both the factory platform and the SaaS solutions it produces.


Path Forward

This HLD is not static. It is the foundation for implementation, and will evolve through:

  • ADRs: Architecture Decision Records documenting trade-offs and changes.
  • Epics & Features: Work planned and executed in Azure DevOps.
  • Continuous Improvement: Post-incident reviews, PIR-driven ADRs, and refinements to templates and runbooks.

Every new SaaS product built with the factory contributes learnings and enhancements back into the core templates, making the platform stronger with each iteration.


Final Note

The ConnectSoft SaaS Factory provides a repeatable, governed, AI-driven model for delivering SaaS at scale. It balances innovation velocity with enterprise-grade guardrails, ensuring that each generated solution is secure, observable, multi-tenant aware, and ready for production from day one.

This document closes the High-Level Design phase and marks the transition to detailed design and implementation.