📋 ConnectSoft SaaS Factory — High-Level Design (HLD)¶
The ConnectSoft SaaS Factory is a generic, template-driven platform for creating SaaS solutions quickly, securely, and consistently.
Unlike traditional projects where each product is designed from scratch, the factory provides paved roads: pre-built architectural templates, infrastructure packs, and automation workflows that allow product teams to focus on business value instead of reinventing technical foundations.
Core pillars:
- Cloud-Native & Event-Driven: Designed for scalability and resilience.
- Security-First & Compliance-by-Default: Guardrails embedded in every template.
- Multi-Tenant & Edition-Aware: Monetization and isolation strategies built in.
- AI-First Orchestration: Semantic Kernel and agentic workflows accelerate delivery.
- Observability Everywhere: Traces, logs, metrics included out of the box.
Target Benefits:
- Cut time-to-market for new SaaS offerings from months to days.
- Guarantee consistency across solutions, reducing operational cost.
- Ensure regulatory compliance from the first deployment.
- Enable innovation by freeing teams from repetitive platform work.
Introduction¶
Context¶
Organizations repeatedly face the same engineering challenges when building SaaS solutions: identity, tenancy, observability, compliance, billing, and extensibility. These components are not differentiators, yet consume majority of engineering time.
The ConnectSoft SaaS Factory addresses this by codifying best practices into reusable assets — templates, IaC modules, service packs, and architectural blueprints — that teams can compose into product recipes.
Audience¶
This HLD is intended for:
- Enterprise Architects: to understand platform scope and evolution.
- Engineering Leads & Developers: to apply templates and patterns.
- Product & Business Stakeholders: to validate alignment with goals.
- Operations & Security Teams: to verify compliance and guardrails.
Scope of Document¶
- What this is: A high-level design of a SaaS Factory that produces SaaS solutions.
- What this is not: A low-level implementation guide or a one-off SaaS product design.
Vision, Outcomes & Scope¶
Purpose¶
The ConnectSoft SaaS Factory exists to provide a generic, reusable platform for building and operating SaaS solutions. Instead of treating each SaaS product as a bespoke project, the factory offers a set of templates, architecture patterns, and automation packs that allow teams to bootstrap new SaaS offerings with speed, quality, and compliance already embedded.
This factory approach ensures that:
- Time-to-market is drastically reduced through pre-built blueprints.
- Consistency across solutions is maintained, lowering operational overhead.
- Security and compliance are guaranteed by default, not bolted on later.
- Observability is always enabled, allowing proactive operations.
- Multi-tenancy and edition models are first-class, making monetization and scaling straightforward.
Problem Space¶
Modern SaaS builders face common challenges:
- Repetition of patterns (auth, tenancy, billing, observability) across every product.
- Time lost in rebuilding baseline infrastructure and compliance controls.
- Inconsistent quality when different teams make different architectural choices.
- High barrier to scaling due to lack of standardized multi-tenancy, resilience, and observability.
- Slow compliance adoption (GDPR, HIPAA, SOC2) when not embedded from the start.
The SaaS Factory addresses these by offering paved roads: highly opinionated but extensible blueprints.
Target Value Map¶
| Dimension | Value Proposition | Example Metric |
|---|---|---|
| Speed | Reduce new SaaS product setup from months to days. | Time from product idea → first running environment. |
| Consistency | Unified architecture, code templates, IaC. | % of services following factory patterns. |
| Compliance | Security & privacy controls built-in. | % of audits passed without major findings. |
| Scalability | Ready-to-scale multi-tenant core. | Number of tenants supported per cluster/node. |
| Innovation | AI-first orchestration and automation. | % of backlog items delivered by AI-generated scaffolds. |
High-Level OKRs¶
Objective 1: Reduce time-to-market for SaaS products.
- KR1: 80% of new products bootstrapped with factory templates within 1 day.
- KR2: First production tenant live within 2 weeks of project kickoff.
Objective 2: Ensure security, compliance, and observability by default.
- KR1: 100% of generated services include OTel, logging, metrics.
- KR2: 100% of generated services use workload identity + secret-less design.
- KR3: ≥ 95% compliance checklist coverage at go-live.
Objective 3: Provide consistent developer experience.
- KR1: 90% of developers report positive DX (survey).
- KR2: Mean time to onboard new engineer ≤ 2 days.
Explicit Non-Goals¶
- Not a one-off platform: This is a factory, not a custom-built product for a single SaaS.
- Not vendor-locked: Although Azure is primary, abstractions exist for other clouds.
- Not monolithic: No “mega-service”; all blueprints enforce microservices and modular design.
- Not compliance-by-documentation: Compliance is enforced technically (policy gates, guardrails), not left to paperwork.
- Not a silver bullet: Factory does not replace product-specific business logic; teams still own domain innovation.
Guardrails¶
- Security & observability cannot be disabled.
- Multi-tenancy and editioning are always enforced.
- ADRs required for deviations from factory templates.
- Paved road approach: 80% covered by factory defaults, 20% extensible through overrides.
Personas, Tenants & Editions¶
Personas¶
The platform supports multiple personas across different layers of the SaaS lifecycle. These are not tied to one product but apply generically across all SaaS solutions generated by the factory:
-
Tenant Administrator Responsible for onboarding, user management, and tenant-level configuration. Uses self-service tooling to manage editions, roles, and features.
-
End User Consumes product functionality within the context of their tenant. Interacts only with features entitlements assigned by edition and policy.
-
Billing & Finance Operator Oversees subscriptions, invoices, and payments across tenants. Works with dynamic billing models and edition-based monetization.
-
Support Agent Provides operational and customer support. Needs tenant-specific observability and diagnostic views across editions.
-
Platform Operator (SRE/DevOps) Maintains infrastructure, enforces global guardrails, ensures SLA adherence across multiple SaaS solutions. Focuses on tenant isolation, scaling, and compliance.
-
Product Owner / Engineering Teams Define new products, editions, and feature packs inside the system dynamically. Responsible for extending templates to create differentiated SaaS offerings.
Tenant Archetypes¶
Different tenant profiles are supported dynamically, with no need to hardcode rules in the factory:
-
Trial / Evaluation Tenant Activated on sign-up, bound to a Free edition, strict quotas. Ideal for fast adoption and funnel growth.
-
Growth Tenant (SMB / Mid-Market) May purchase Standard or custom editions. Needs integration with external identity, moderate quotas, and cost-effective plans.
-
Enterprise Tenant Assigned to Enterprise edition or a custom product pack. Requires SSO, SCIM provisioning, audit exports, custom compliance, and region-specific data residency.
Editions¶
The factory does not limit editions to a fixed set; editions are defined dynamically in the target product configuration:
- Free Edition (default template) Minimal feature set, reduced quotas, short SLA. Used as a baseline template for trials.
- Standard Edition (default template) Full baseline capability pack, medium quotas, typical SLA.
- Enterprise Edition (default template) Advanced feature pack, premium quotas, enterprise SLA.
- Custom Editions Created dynamically by product owners in the system, composed of features, quotas, and entitlements. Example: Healthcare Pack, FinTech Pack.
All editions are described as metadata objects stored in the platform and enforced by policy engines at runtime.
Tenant × Edition × Capability Matrix¶
Because editions are dynamic, the following matrix represents the default templates provided by the factory. Product owners may extend or override it.
| Capability | Free | Standard | Enterprise | Custom (Dynamic) |
|---|---|---|---|---|
| Multi-Tenancy Isolation | ✓ | ✓ | ✓ | ✓ |
| Identity (OIDC/SSO) | – | ✓ | ✓ (SCIM) | Configurable |
| API Rate Limit (req/min) | 60 | 600 | 3000 | Configurable |
| Feature Flags | Basic | Standard | Advanced | Configurable |
| Observability Dashboards | Shared | Tenant | Advanced | Configurable |
| Data Residency Choice | – | – | ✓ | Configurable |
| SLA / Support | – | 8×5 | 24×7 | Configurable |
Top Use Cases¶
-
Dynamic Product Creation Product owner defines a new product, editions, and entitlements via admin console or API. The factory applies baseline templates and stores metadata.
-
Tenant Onboarding Tenant signs up and is provisioned with the default Free edition. Admin can later upgrade or assign a custom edition without re-deployment.
-
Edition Upgrade / Downgrade Tenant transitions dynamically between editions (e.g., Free → Standard, Standard → Enterprise) without migration downtime.
-
Custom Edition Rollout A healthcare SaaS product defines a Healthcare Pack with HIPAA compliance, custom storage policies, and audit logging. Tenants can subscribe to this edition dynamically.
-
Tenant-Specific Support Support agent views tenant entitlements, edition policies, and quotas to resolve feature-availability questions quickly.
Strategic Domains & Context Map¶
Purpose¶
Define the bounded contexts that compose the generic SaaS platform, clarify their responsibilities and collaboration style, and set the principles for clean integration across domains. The goal is to maximize autonomy, enable evolution without ripple effects, and keep multi-tenancy, security, and observability as first-class concerns within each boundary.
Bounded Contexts (overview)¶
| Context | Core Responsibility | Primary Interfaces | Persistence (default) | Tenancy Notes |
|---|---|---|---|---|
| SaaS Core Metadata | System-of-record for Products, Editions, Features, Entitlements, Quotas, Pack composition | REST (admin), Events | Azure SQL | Enforces product/edition metadata; emits entitlement updates |
| Identity | OIDC provider (OpenIddict/AAD), Users, Roles, Scopes, Service Principals | OIDC/OAuth2, REST, Events | Azure SQL | Token carries tenantId, edition, entitlements claims |
| Tenant Management | Tenant lifecycle (provision, activate, suspend, delete), Region/Residency, Tenant config bootstrap | REST, Events | Azure SQL | Owns tenantId; author of cross-context tenant events |
| Billing | Plans, Subscriptions, Invoicing, Usage Rating; Payment provider integrations (via ACL) | REST/Webhooks, Events | Azure SQL | Per-tenant subscription state and entitlements sync |
| Configuration (Config & Feature Flags) | Dynamic flags, settings, edition overrides, kill-switches | REST, Events | Azure SQL + Redis | Per-tenant/edition flag resolution; policy guardrails |
| Usage & Metering | API call meters, storage/compute consumption, quota enforcement | REST (read), Events (write) | Azure SQL (agg) + cold storage | Emits usage events to Billing and Observability |
| Notifications | Email/SMS/Webhook dispatch, templates, tenant branding | REST, Events | MongoDB (templates) + Queue | Tenant-scoped channels; signed webhooks |
| Audit & Compliance | Immutable audit trail, access logs, policy decisions, exports | REST (query), Events (append) | Azure SQL (append-only) | Cross-tenant isolation; long retention & eDiscovery |
| AI Orchestration | Agentic flows (Semantic Kernel), scaffolding, docs/tests generation, safe tool use | REST, Jobs, Events | Blob/Queue for artifacts | Runs under least-privileged scopes; auditable tools |
Notes:
• Storage choices are defaults; contexts can swap with portable equivalents.
• Each context publishes canonical events with required headers (
traceId,tenantId,edition,schemaVersion).
Integration Styles & Collaboration¶
- Event-first collaboration: Domain facts published as immutable events; subscribers build local read models or kick off workflows.
- REST for command/query where necessary: CRUD/admin operations and strongly consistent reads remain RESTful within the boundary.
- Webhooks for external extensibility: Billing, Notifications, and Audit expose signed webhooks for third-party systems.
- Anti-Corruption Layers (ACLs): External payment, messaging, or identity providers are wrapped behind ACLs to protect internal models.
- Outbox/Inbox patterns: All cross-context messages flow through outbox (producer) and inbox (consumer) for idempotency and reliability.
Context Relationships (narrative)¶
- SaaS Core Metadata is upstream for Config, Billing, Usage, and Identity by defining products, editions, and entitlements.
- Tenant Management is the authoritative source of
tenantIdlifecycle and residency; downstream systems subscribe to tenant.created/updated/deleted events. - Billing depends on Usage for rated consumption and on SaaS Core Metadata for plan/entitlement definitions; pushes subscription state changes to Identity (claims) and Config (feature gates).
- Config consumes product/edition definitions and applies edition overrides and tenant-scoped flags, influencing feature availability platform-wide.
- Audit subscribes to events from every context and exposes immutable, query-optimized views for operators and compliance exports.
- AI Orchestration is lateral: it reads contracts and policies, generates scaffolds/PRs, and never mutates domain state without going through the owning context’s APIs.
- Notifications consumes events across contexts to deliver comms and expose signed webhooks to tenant systems.
Context Map (Mermaid)¶
flowchart LR
subgraph Upstream
SCM[SaaS Core Metadata]
TEN[Tenant Management]
end
IDP[Identity]:::core
BILL[Billing]:::core
CONF[Config & Flags]:::core
USE[Usage & Metering]:::core
AUD[Audit & Compliance]:::core
NOTIF[Notifications]:::edge
AIO[AI Orchestration]:::edge
TEN -->|tenant.created/updated| IDP
TEN --> BILL
TEN --> CONF
TEN --> USE
TEN --> AUD
TEN --> NOTIF
SCM -->|product/edition/entitlement| BILL
SCM --> CONF
SCM --> IDP
USE -->|usage.recorded| BILL
BILL -->|subscription.activated/changed| IDP
BILL -->|entitlements.updated| CONF
%% Lateral
AIO -. reads contracts & policies .-> SCM
AIO -. opens PRs/tests .-> CONF
%% Observability/Audit taps
IDP --> AUD
BILL --> AUD
CONF --> AUD
USE --> AUD
NOTIF --> AUD
classDef core fill:#0b6,stroke:#094,stroke-width:1,color:#fff;
classDef edge fill:#357,stroke:#234,stroke-width:1,color:#fff;
Anti-Corruption Layer Patterns¶
-
Payment Providers (e.g., Stripe, Adyen): Billing implements a Payment ACL translating provider-specific objects (invoices, payment intents) into internal subscription and charge events. Retries and signature validation live in the ACL.
-
External Identity Providers (e.g., Azure AD, Okta): Identity provides a Federation ACL for SSO and SCIM. External attributes are mapped to internal
roles/scopes/claimswith normalization and validation. -
Email/SMS Gateways: Notifications uses a Messaging ACL to normalize templates, rate limits, and delivery receipts across providers.
Tenancy & Security Considerations¶
- Tenant authority: Tenant Management is the sole issuer of
tenantIdand region/residency attributes; all contexts must validate inboundtenantIdagainst their read model. - Edition enforcement: Identity tokens include
editionand derivedentitlements; services must authorize using policy checks, not UI hints. - Isolation: Data access is constrained by repository guards and (where supported) RLS/filters keyed by
tenantId. - mTLS & Workload Identity: Cross-context calls occur over mTLS; services authenticate using workload identities, not secrets.
- Least privilege: Each context exposes minimal operations; event consumers run with scoped permissions.
Observability & SLO Notes¶
- Traceability: Every cross-context call/event carries
traceId,tenantId, andedition. Spans are namedctx.operation(e.g.,billing.rate-usage). - Golden signals per context:
- Identity: token issuance latency p95, failure rate
- Billing: invoice generation p95, reconciliation lag
- Config: flag evaluation latency p95, cache hit%
- Usage: event ingest lag p95, quota enforcement accuracy
- Audit: write throughput, query latency p95
- Error budgets: Defined per context with shared budgets for critical flows (e.g., onboarding path TEN→CONF→IDP).
Evolution Principles¶
- Independent deployability: Contexts release on their own cadence; contracts evolve via additive changes and versioned topics/paths.
- Schema evolution: Event versioning via
type.suffix.vN; consumers must tolerate unknown fields. - Backwards compatibility: Deprecations follow a documented sunset window; ACLs absorb third-party churn.
- Testing in isolation: Contract and consumer-driven tests validate behavior without requiring full-platform spin-up.
Solution Architect Notes¶
- Start with SaaS Core Metadata and Tenant Management as foundation; everything else composes around their events.
- Treat Billing and Config as policy enforcers driven by metadata and subscriptions; resist hardcoding feature switches inside services.
- Keep AI Orchestration stateless with auditable tool calls; it should propose and scaffold, not silently mutate domain state.
- Maintain strict Audit boundaries: append-only, immutable identifiers, and explicit export paths with data minimization.
Non-Functional Requirements & SLOs¶
Purpose¶
The SaaS Factory must deliver predictable, reliable, and secure foundations for any generated SaaS product. Non-functional requirements (NFRs) set the quality bar across dimensions like performance, availability, scalability, security, privacy, compliance, and portability. Service Level Indicators (SLIs) and Service Level Objectives (SLOs) ensure that these qualities are measured and enforced consistently.
Performance¶
- API Latency:
- p95 read latency ≤ 200 ms
- p95 write latency ≤ 350 ms
- Throughput:
- Minimum 1k requests/minute per tenant on pooled editions
- Scalable to 50k requests/minute on Enterprise
Performance requirements apply uniformly across generated services, with quotas and scaling managed by edition policies.
Availability¶
- Platform Uptime: 99.9% monthly baseline (Enterprise tiers may raise this to 99.95%)
- Critical APIs (Auth, Tenant Onboarding, Billing): ≥ 99.95%
- Scheduled Maintenance: Maximum of 4 hours per month, communicated via status channels
Error budgets are defined per tier; for example, Free tenants may tolerate higher downtime than Enterprise tenants.
Scalability¶
- Multi-Tenant Growth: Must scale to 10k+ tenants per cluster with pooled database model.
- Elastic Scaling: Auto-scale services based on CPU/memory utilization and queue lag.
- Edition-Aware Quotas: Rate limits, storage, and compute quotas enforced dynamically per edition.
Scalability is achieved through horizontal scaling (stateless microservices) and vertical scaling for data workloads.
Security & Privacy¶
- Zero Trust Enforcement: All inter-service traffic authenticated with workload identities + mTLS.
- Data Protection:
- Encryption at rest (AES-256)
- Encryption in transit (TLS 1.3)
- Secrets Management: All secrets stored in Key Vault; secret-less by default with managed identities.
- Privacy by Design: PII minimized, data masking in logs, erasure supported via APIs.
- Compliance Baselines: GDPR, SOC2, HIPAA (configurable by product/edition).
Observability¶
- Tracing: 100% of requests/events tagged with
traceId,tenantId,edition. - Metrics: Golden signals for latency, errors, saturation, traffic.
- Logging: Structured, JSON, PII redacted by default.
- Dashboards: Predefined for each context; customizable by tenant/edition.
Observability cannot be disabled. It is a core guardrail across all generated services.
Compliance & Portability¶
- Compliance-by-Default: All services scaffolded with standard controls.
- Auditability: Immutable audit logs retained for 7 years (configurable).
- Portability: Azure-first, but abstractions for SQL → PostgreSQL, Service Bus → Kafka, Key Vault → Vault.
Generated SaaS solutions can be redeployed across clouds without rewriting core services.
SLI/SLO Baseline Table¶
| Dimension | SLI | SLO (Baseline) | Notes / Error Budget |
|---|---|---|---|
| Performance | p95 read/write latency | ≤ 200 ms / 350 ms | Budget: 5% requests may exceed |
| Availability | Platform uptime | ≥ 99.9% | Free edition: 99.5% |
| Scalability | Max tenants per cluster | ≥ 10,000 | Quotas per edition |
| Security | % secrets managed via KV/MSI | 100% | No exceptions allowed |
| Privacy | % requests with PII redacted in logs | 100% | Guardrail; cannot be disabled |
| Observability | % requests traced w/ traceId,tenantId | 100% | Mandatory invariant |
| Compliance | Audit record retention | 7 years | Configurable override possible |
Error Budgets¶
- Availability: If uptime falls below 99.9%, feature velocity is slowed until SLO compliance restored.
- Performance: 5% budget for latency breaches; if exceeded, scaling or optimization prioritized.
- Security/Privacy: No error budget — violations trigger incident severity 1.
- Observability: No error budget — missing telemetry is considered a defect.
Solution Architect Notes¶
- Apply edition-specific SLO overlays (Enterprise gets stricter SLOs than Free/Standard).
- Use synthetic checks for critical flows (tenant onboarding, login, billing).
- Embed NFRs into CI/CD as automated quality gates.
- Reassess SLOs quarterly and align with customer contracts.
Platform Reference Architecture¶
Purpose¶
Define a reusable runtime blueprint for the generic SaaS platform across environments. The architecture emphasizes edge security, multi-tenant isolation, event-driven collaboration, and observability by default, while remaining portable between AKS and Azure Container Apps (ACA).
Overview¶
The platform is partitioned into clear planes and tiers:
- Edge Plane: Front Door/WAF → API Gateway (YARP) → UI/Portal
- Control Plane: Identity Provider, Policy/Config, CI/CD, IaC
- Data & Messaging Plane: Azure SQL (primary), optional MongoDB, Redis, Storage, Service Bus
- Workload Plane: Core microservices per bounded context (Identity, Tenant, Billing, Config, Usage, Notifications, Audit, AI Orchestration)
- Observability Plane: OpenTelemetry collectors, Logs, Metrics, Traces, Dashboards/Alerts
- Jobs Plane: Hangfire/KEDA workers, scheduled tasks, DLQ replayers
C4 Container (Mermaid)¶
C4Container
title Generic SaaS Platform — Container View
Person_TenantAdmin( Tenant Administrator )
Person_EndUser( End User )
Person_Operator( Platform Operator / SRE )
Container_Browser(Web UI / Tenant Portal, "SPA", "OIDC client")
Container_Admin(Admin Console, "SPA", "OIDC client")
Container_Gateway(API Gateway, "YARP", "AuthN/Z, routing, rate limits, transforms")
Container_IdP(Identity Provider, "OpenIddict/Azure AD", "Tokens, RBAC/ABAC")
Container_Svc_Tenant(Tenant Service, ".NET", "Tenant lifecycle, residency")
Container_Svc_Billing(Billing Service, ".NET", "Plans, subscriptions, invoices")
Container_Svc_Config(Config & Feature Flags, ".NET", "Flags, edition overrides")
Container_Svc_Usage(Usage & Metering, ".NET", "Meters, quotas, aggregates")
Container_Svc_Notify(Notifications, ".NET", "Email/SMS/Webhooks via ACL")
Container_Svc_Audit(Audit & Compliance, ".NET", "Append-only audit log")
Container_Svc_Metadata(SaaS Core Metadata, ".NET", "Products, editions, entitlements")
Container_Svc_AI(AI Orchestration, ".NET + SK", "Agentic scaffolding, tests/docs")
Container_Bus(Service Bus, "Topics/Queues", "Async events, DLQ")
Container_SQL(Azure SQL, "Relational", "Primary state (RLS/tenant guards)")
Container_Mongo(MongoDB (opt), "Document", "Templates & payloads (notifications)")
Container_Redis(Redis, "Cache", "Flag evaluation, token/claim caches")
Container_Storage(Storage, "Blob", "Artifacts, exports")
Container_Obs(Observability Stack, "OTel/Prometheus/Grafana/Logs", "Traces, metrics, logs")
Container_Jobs(Jobs Runtime, "Hangfire/KEDA", "Schedulers, DLQ replay")
Rel(Tenant Administrator, Web UI / Tenant Portal, "Browser, OIDC")
Rel(End User, Web UI / Tenant Portal, "Browser, OIDC")
Rel(Platform Operator / SRE, Admin Console, "Browser, OIDC")
Rel(Web UI / Tenant Portal, API Gateway, "HTTPS")
Rel(Admin Console, API Gateway, "HTTPS")
Rel(API Gateway, Identity Provider, "OIDC flows, token validation")
Rel(API Gateway, Tenant Service, "mTLS, JWT")
Rel(API Gateway, Billing Service, "mTLS, JWT")
Rel(API Gateway, Config & Feature Flags, "mTLS, JWT")
Rel(API Gateway, Usage & Metering, "mTLS, JWT")
Rel(API Gateway, Notifications, "mTLS, JWT")
Rel(API Gateway, Audit & Compliance, "mTLS, JWT")
Rel(API Gateway, SaaS Core Metadata, "mTLS, JWT")
Rel(API Gateway, AI Orchestration, "mTLS, JWT")
Rel(Tenant Service, Service Bus, "publish/subscribe")
Rel(Billing Service, Service Bus, "publish/subscribe")
Rel(Config & Feature Flags, Service Bus, "publish/subscribe")
Rel(Usage & Metering, Service Bus, "publish/subscribe")
Rel(Notifications, Service Bus, "consume events")
Rel(Audit & Compliance, Service Bus, "consume events")
Rel(AI Orchestration, Service Bus, "publish/subscribe")
Rel(Tenant Service, Azure SQL, "ADO.NET")
Rel(Billing Service, Azure SQL, "ADO.NET")
Rel(Config & Feature Flags, Azure SQL, "ADO.NET / Redis")
Rel(Usage & Metering, Azure SQL, "ADO.NET")
Rel(Notifications, MongoDB (opt), "Drivers")
Rel(Audit & Compliance, Azure SQL, "Append-only")
Rel(AI Orchestration, Storage, "Artifacts")
Rel(All, Observability Stack, "OTLP (traces/metrics/logs)")
Rel(Jobs Runtime, Service Bus, "Job queues, DLQ replay")
Runtime Topologies¶
AKS (Kubernetes)
- Best for large multi-tenant scale, advanced networking, service mesh, and fine-grained autoscaling.
- Supports mTLS via service mesh, network policies, and workload identities.
- Suited for Enterprise and products requiring custom sidecars or high-throughput messaging.
ACA (Azure Container Apps)
- Simpler operational footprint, KEDA-native autoscaling on HTTP/queue metrics.
- Excellent for SMB/Standard offerings, jobs, workers, and bursty workloads.
- Can coexist with AKS as a jobs plane (DLQ replayers, batch processors).
Both variants keep the contract and container boundaries identical; choice is an operational concern decided per environment or product.
Network & Trust Boundaries¶
- Public Edge: Front Door/WAF terminates TLS; only the Gateway is internet-exposed.
- Private App Network: All services run in private subnets with deny-by-default policies.
- Data Network: Databases and storage are accessible via Private Link; no public endpoints.
- Observability & CI/CD: Access via managed identities and least privilege; audit logs immutable.
Trust boundaries are drawn at the Gateway edge and between Workload Plane ↔ Data Plane. Every call across a boundary uses mTLS and JWT validation (for user/service identity).
Identity & Secrets¶
- User/Client Identity: OIDC tokens issued by OpenIddict/Azure AD, including
tenantId,edition,entitlements. - Workload Identity: Managed identity for services; avoid secret injection; Key Vault for any remaining secrets.
- Policy Enforcement: Scope/role checks at the gateway and policy filters in services (RBAC/ABAC).
- Key Rotation: Automated; rotation ≤ 90 days baseline.
Data & Storage Layer¶
- Primary: Azure SQL with tenant guards (RLS or repository filters), strict schema ownership per context.
- Optional: MongoDB for notification templates/payloads; Redis for low-latency flag evaluation and token claims.
- Artifacts: Blob storage for exports, audit bundles, and AI scaffolding outputs.
- Retention: Defaults per context (e.g., Audit 7y, Usage raw 90d + aggregates).
Messaging Layer¶
- Azure Service Bus: Topics/queues for domain events, sagas, and DLQs.
- MassTransit: Implements outbox/inbox, retry with jitter, and saga coordination.
- Contracts: Canonical events with versioning; idempotency keys; signed webhook exports at the edge.
Observability Plane¶
- OpenTelemetry Everywhere: Traces, logs, metrics with required attributes (
traceId,tenantId,edition). - Dashboards: Per context and per tenant/edition views; error budgets and SLO tracking.
- Log Hygiene: Structured JSON, PII redaction by default, correlation with audit records.
Jobs & Scheduling¶
- Recurring/Scheduled: Hangfire or ACA jobs with UTC cron; idempotent job keys.
- Event-Driven: KEDA scales workers off queue length and lag (DLQ replayers, compactions).
- Observability: Job success/failure metrics, run durations, and retries are first-class signals.
High Availability & Scaling¶
- Stateless Services: Horizontal Pod Autoscaler (AKS) or KEDA (ACA).
- Stateful Stores: Active geo-replication (SQL), zone-redundant storage, and backup/restore runbooks.
- Multi-Tenancy Scaling: Edition-aware quotas; per-tenant throttling at the gateway and policy-based limits in services.
- Blue/Green & Canary: Gateway routes plus deployment strategies to minimize risk.
Failure Modes & Recovery (selected)¶
- Gateway Degradation: Fail closed for auth; serve static maintenance page via Front Door.
- Bus Backlog: KEDA autoscale; overflow to DLQ; DLQ replay jobs with circuit breakers.
- DB Hot Partition: Trigger tenant sharding or schema-per-tenant promotion per policy.
- IdP Outage: Use cached tokens within acceptable TTL; degrade gracefully for non-critical flows.
Solution Architect Notes¶
- Start deployments with ACA for simplicity; promote to AKS where fine-grained control or mesh features are required.
- Keep the Gateway thin; push business decisions into services and policy layers.
- Enforce mTLS + workload identity ubiquitously; secrets are exceptions, not norms.
- Make observability non-negotiable: traces/logs/metrics must ship before exposing any public endpoint.
Edge & API Gateway¶
Purpose¶
Establish a secure, policy-driven ingress that terminates public traffic, authenticates at the edge, resolves tenants, enforces edition-aware quotas, and steers requests to the correct backend services. The gateway is a custom .NET Core solution built on YARP (reverse proxy) so we can embed ConnectSoft’s tenancy/security/observability invariants and progressive-delivery controls (canary, blue/green).
Ingress Topology & Trust Boundaries¶
- Public Edge: Azure Front Door + WAF (TLS 1.3, DDoS protection) → Gateway (internet-facing).
- Private App Network: Gateway to services over mTLS within a private VNet.
- Identity Boundary: Gateway is the primary policy enforcement point; it validates OIDC tokens and workload identities, and stamps downstream calls with normalized headers and claims.
- Observability Boundary: Gateway attaches mandatory correlation and tenancy headers (e.g.,
x-trace-id,x-tenant-id,x-edition) and emits OTel spans.
sequenceDiagram
participant C as Client
participant F as Front Door/WAF
participant G as API Gateway (.NET+YARP)
participant I as Identity (OpenIddict/AAD)
participant S as Backend Service
C->>F: HTTPS request
F->>G: Forward with TLS
G->>I: Validate token / challenge if needed
I-->>G: JWT (tenantId, edition, scopes)
G->>G: Tenant resolution (host/header/token), rate-limit, authZ
G->>S: mTLS + normalized headers (traceId, tenantId, edition)
S-->>G: Response
G-->>C: Response (transforms, caching hints)
Request Lifecycle (Edge Policies)¶
- Transport & TLS: Front Door terminates public TLS; Gateway re-terminates internally and enforces HSTS and strict cipher suites.
- Authentication: OIDC bearer validation at the gateway; anonymous routes explicitly whitelisted (e.g.,
/public/*, webhook callbacks). - Tenant Resolution: Precedence Header (
x-tenant-id) → Hostname subdomain → Token claim; rejects ambiguous/missing tenancy. - Authorization: RBAC/ABAC decision at edge when possible (scope/role/edition checks); fine-grained decisions may be delegated with policy headers.
- Quota & Rate Limiting: Edition-aware token-bucket limits with leaky-bucket smoothing; per-tenant counters.
- Routing & Versioning: Path, header, or media-type versioning (e.g.,
/v1/...,Accept: application/vnd.connectsoft.v2+json). - Transforms: Add/remove/normalize headers, response shape harmonization for legacy clients.
- Progressive Delivery: Weighted routing for canary and blue/green; circuit-breakers and outlier detection.
- Observability: OTel spans with
traceId,spanId,tenantId,edition,routeId,backendId; structured logs with PII redaction.
Authentication & Authorization at the Edge¶
- Tokens: OIDC JWT with mandatory claims:
sub,tenantId,edition,scp(scopes),roles[],entitlements{}. - Service-to-Service: Gateway accepts mTLS and/or signed JWT from trusted callers (jobs, webhooks) and issues downstream identities via signed headers plus mTLS to services.
- Anonymous Access: Explicit allow-list (e.g., health, well-known, webhook receiver). All else requires valid token.
- Policy Evaluation: Coarse-grained at edge (block early), fine-grained within services (resource-level ABAC). Deny-by-default.
Edition-Aware Rate Limits & Quotas (defaults)¶
| Edition | Global RPM (per tenant) | Burst | Concurrency | Notes |
|---|---|---|---|---|
| Free | 60 | 120 | 10 | Trial-friendly; strict retries |
| Standard | 600 | 1200 | 50 | Typical workloads |
| Enterprise | 3000 | 6000 | 200 | Prioritized queues & support |
| Custom | Configurable | – | – | Set via edition metadata |
Enforced at edge; mirrored by backend safeguards to prevent “quota bypass.”
Routing, Versioning & Canary¶
- Routes: Path-based (
/api/tenants/*), tag-based (x-product,x-context), and method-aware (PCI-safe rules for billing). - Versioning: Path or content-negotiation; gateway validates supported versions and forwards
x-api-versiondownstream. - Canary / Blue-Green: Weighted clusters (e.g., 90/10) and header-based targeting for internal testers (
x-canary: true). Automatic rollback on SLO breach (latency/error-rate thresholds). - Shadow Traffic (optional): Duplicate a fraction of reads to a shadow backend for safe testing without client impact.
Request/Response Transformations (examples)¶
- Inbound: Strip hop-by-hop headers; enforce
x-tenant-id; normalizeAcceptandContent-Type; injectx-correlation-idwhen missing. - Outbound: Map backend error envelopes to a factory-standard error schema; attach cache hints for GETs; remove internal headers.
Resilience at the Edge¶
- Timeouts: Sensible route-level timeouts (e.g., 2s internal, 5s external).
- Retries: Idempotent methods only (GET/HEAD/OPTIONS) with exponential backoff + jitter.
- Circuit Breakers: Per-destination outlier detection; automatic ejection and gradual recovery.
- Backpressure: 429 with
Retry-Afterand tenant-specific hints; shed load for Free first.
Observability (Gateway Signals)¶
- Traces:
gateway.request,gateway.route.resolve,gateway.authn,gateway.authz,gateway.ratelimit,gateway.proxy. - Metrics: Requests/sec by route/tenant/edition, p95/p99 latency, upstream error rate, ejected destinations, rate-limit hits, auth failures.
- Logs: Redacted request/response summaries with route, tenant, edition, traceId; WAF correlation.
Configuration Templates (illustrative)¶
YARP Routes & Clusters (weighted canary + transforms)
{
"ReverseProxy": {
"Routes": [
{
"RouteId": "tenant-api",
"Match": { "Path": "/api/tenants/{**catch-all}" },
"Transforms": [
{ "RequestHeader": "X-Correlation-Id", "Set": "{TraceId}" },
{ "RequestHeader": "X-Tenant-Id", "Set": "{TenantId}" },
{ "RequestHeaderOriginalHost": "true" }
],
"ClusterId": "tenant-svc",
"AuthorizationPolicy": "RequireAuthenticatedUser",
"RateLimiterPolicy": "EditionAwarePolicy",
"CorsPolicy": "Default"
}
],
"Clusters": {
"tenant-svc": {
"Destinations": {
"stable": { "Address": "http://tenant-svc-v1/" },
"canary": { "Address": "http://tenant-svc-v2/" }
},
"LoadBalancingPolicy": "PowerOfTwoChoices",
"SessionAffinity": { "Enabled": false },
"HealthCheck": { "Passive": { "Enabled": true } },
"Metadata": { "CanaryWeights": "stable=90;canary=10" }
}
}
}
}
.NET Rate Limiting (edition-aware)
builder.Services.AddRateLimiter(options =>
{
options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;
options.AddPolicy("EditionAwarePolicy", context =>
{
var edition = context.User?.FindFirst("edition")?.Value ?? "Free";
var (permit, replen, burst) = edition switch
{
"Enterprise" => (3000, TimeSpan.FromMinutes(1), 6000),
"Standard" => (600, TimeSpan.FromMinutes(1), 1200),
_ => (60, TimeSpan.FromMinutes(1), 120),
};
return RateLimitPartition.GetTokenBucketLimiter(
partitionKey: $"{edition}:{context.Request.Headers["x-tenant-id"]}",
factory: _ => new TokenBucketRateLimiterOptions
{
TokenLimit = burst,
TokensPerPeriod = permit,
ReplenishmentPeriod = replen,
AutoReplenishment = true,
QueueLimit = 0
});
});
});
Program.cs (YARP + OIDC + OTel + mTLS enforcement)
builder.Services.AddAuthentication("Bearer")
.AddJwtBearer("Bearer", o =>
{
o.Authority = builder.Configuration["IdP:Authority"];
o.TokenValidationParameters.ValidateAudience = false;
o.MapInboundClaims = false;
});
builder.Services.AddReverseProxy().LoadFromConfig(builder.Configuration.GetSection("ReverseProxy"));
builder.Services.AddOpenTelemetry().WithTracing(t => t.AddAspNetCoreInstrumentation().AddHttpClientInstrumentation());
app.UseAuthentication();
app.Use(async (ctx, next) =>
{
// Enforce mTLS from Front Door private link or internal LB
if (!ctx.Connection.ClientCertificate?.Verify() ?? true)
{
ctx.Response.StatusCode = StatusCodes.Status403Forbidden;
return;
}
await next();
});
app.UseRateLimiter();
app.MapReverseProxy();
Failure Modes & Playbooks (selected)¶
- Token Validation Failures: Return 401/WWW-Authenticate; verify IdP health; enable cached signing keys with TTL; fail closed for sensitive routes.
- Backend Saturation: Trigger circuit breaker; reduce canary weight to 0; raise 429 with
Retry-After. - Route Drift / Misconfig: Config lint & contract tests in CI; runtime config reload guarded by feature flag; instant rollback to last-known-good.
- Tenant Ambiguity: Reject with 400 + problem details; provide diagnostic trace; require explicit
x-tenant-idor correct host.
Solution Architect Notes¶
- Keep the gateway thin but policy-rich: authN/Z, tenancy, quotas, and traffic shaping; no business logic.
- Prefer header-based canary steering for internal testing and percentage-based for public rollouts.
- Make edition-aware rate limiting visible to tenants via headers and usage endpoints.
- Treat the gateway as a security product: frequent pen tests, strict dependency hygiene, and SBOM/signing in CI.
Identity, Authentication & Authorization¶
Purpose¶
Provide a unified, multi-tenant identity plane for users and services. Standardize OAuth2/OIDC flows, token and claim design, RBAC/ABAC policy enforcement, and workload identity for service-to-service calls. Support both a custom .NET (OpenIddict) Identity Provider and external IdPs (Azure AD/Okta) behind a federation boundary.
Trust Boundaries & High-Level Flows¶
flowchart LR
U[User/Client App] -->|OIDC| G[Edge Gateway]
G --> IdP[Identity Provider (OpenIddict/AAD)]
G --> Svc[Backend Services]
IdP -->|JWT (tenantId, edition, scopes, roles)| G
G -->|mTLS + normalized identity headers| Svc
Svc -->|policy check (RBAC/ABAC)| Svc
subgraph Identity Plane
IdP
end
subgraph Workload Plane
Svc
end
classDef boundary stroke-width:2,stroke:#999
Boundaries
- Public boundary: Clients ↔ Gateway (OIDC/OAuth2); WAF + TLS 1.3.
- Control boundary: Gateway ↔ Services (mTLS + JWT, workload identity).
- Identity boundary: Gateway trusts IdP token-signing keys; services trust gateway-issued identity context and validate tokens again internally.
Identity Provider Options¶
Primary (factory-default): Custom .NET IdP using OpenIddict
- Supports Authorization Code + PKCE, Client Credentials, Device Code (optional), and Refresh Tokens.
- Multi-tenant claim issuance; edition/entitlement enrichment from SaaS Core Metadata.
- Local user store (ASP.NET Identity) plus federation with external IdPs.
- SCIM 2.0 (Enterprise) for just-in-time provisioning and deprovisioning.
Federated (enterprise option): Azure AD / Okta
- External IdP via OIDC federation; Federation ACL maps external attributes to internal roles/scopes.
- Supports SSO, conditional access, MFA, and B2B invites.
Token Model (JWT)¶
Standard claims
sub,iss,aud,exp,iat,nbf
Multi-tenant & edition claims
tenantId: authoritative tenant identifier (issued by Tenant Management)edition: plan identifier (e.g.,Free,Standard,Enterprise, or custom)entitlements: bag of feature flags/limits at issuance time (digest, not the source of truth)
Authorization claims
scp(scopes): API permissions (coarse-grained)roles: high-level roles (e.g.,tenant_admin,support_agent,billing_admin)abac: optional attribute set for policy engines (e.g.,{"region":"EU","dataClass":"PII"})
Service identity
- Client Credentials flow issues tokens for service principals with
appId,aud, and minimal scopes. - Downstream calls are authenticated with mTLS and validated JWT, with identity context propagated via headers (
x-tenant-id,x-actor,x-roles,x-scope).
Lifetimes (defaults)
- Access token: 15 minutes
- Refresh token: 24 hours (rotating)
- Client credentials token: 10 minutes
- Key rotation: ≤ 90 days (automated), JWKS exposed
RBAC / ABAC Authorization¶
RBAC (role-based)
- Roles assigned per tenant; example roles:
tenant_admin,member,billing_admin,support_agent,operator. - Services enforce role gates for administrative operations.
ABAC (attribute-based)
- Policies evaluate attributes from token + request context (tenant, edition, resource owner, region, data class).
- Example: “Users with role
support_agentmay read logs only whentenant.support_access=trueanddataClass != PII.”
Hybrid
- Coarse authorization at the gateway (scopes/roles).
- Fine-grained authorization inside services (ABAC over resource attributes).
Scope Catalog (illustrative)¶
| Scope | Audience | Description | Typical Roles |
|---|---|---|---|
tenant.read |
Tenant API | Read tenant profile & settings | tenant_admin, support_agent |
tenant.manage |
Tenant API | Create/Update/Delete tenant resources | tenant_admin |
billing.read |
Billing API | Read subscriptions/invoices | billing_admin, tenant_admin |
billing.manage |
Billing API | Modify plans, payment methods | billing_admin |
config.read |
Config API | Read flags and settings | member, tenant_admin |
config.manage |
Config API | Create/Update flags, overrides | tenant_admin |
usage.read |
Usage API | Read metering/quota | tenant_admin, support_agent |
audit.read |
Audit API | Query audit logs | tenant_admin, operator |
notify.send |
Notify API | Send messages (templatized) | tenant_admin |
ai.orchestrate |
AI API | Invoke agentic flows/tools | tenant_admin, engineer |
Scopes are additive; deprecations follow a sunset policy. Services validate both scope and role where applicable.
Login & Token Issuance (sequence)¶
sequenceDiagram
participant B as Browser/App
participant G as Gateway
participant I as IdP (OpenIddict/AAD)
participant M as Metadata (Products/Entitlements)
B->>G: /authorize
G->>I: OIDC Auth Code + PKCE
I-->>B: auth code
B->>G: code exchange
G->>I: token request
I->>M: enrich claims (tenant, edition, entitlements)
M-->>I: entitlement snapshot
I-->>G: id_token + access_token + refresh_token
G-->>B: session established (SPA stores tokens securely)
Notes
- Enrichment pulls current edition/entitlements at issuance; services must still consult Config for real-time flag evaluation.
- PKCE & MFA recommended for all first-party SPAs and public clients.
Tenant Resolution & Federation¶
- Resolution precedence:
x-tenant-idheader → subdomain → token claim. Gateway rejects ambiguous requests. - Federation: Enterprise tenants may authenticate via external IdPs; federation ACL translates external
groups/claimsinto internalroles/scopes. - SCIM (Enterprise): Automates user/role provisioning; deprovision triggers session revocation.
Workload Identity (service-to-service)¶
- Managed Identity (AKS/ACA) binds workloads to identities; outbound calls signed at transport (mTLS) and application (JWT).
- No static secrets in services; Key Vault for exceptional credentials (e.g., third-party webhooks).
- Downstream identity propagation: services forward correlation and minimal identity context; avoid token forwarding unless necessary.
Security & Privacy Controls¶
- Zero Trust: deny-by-default, least privilege, explicit allow-lists for anonymous routes.
- mTLS: gateway↔service and service↔service; certificate pinning where feasible.
- PEP/PDP separation: gateway acts as Policy Enforcement Point; services host Policy Decision logic for resource-level checks.
- PII safety: never write raw PII to logs; redaction at sinks; audit every elevation (admin actions).
- Consent & Terms: first-class records per tenant; tracked in Audit.
Observability Signals¶
- Auth signals: token issuance latency, failed validations, JWKS fetch errors.
- Access signals: authz denials by route/scope/role, edition-policy mismatches.
- Federation: IdP health, SCIM drift (orphaned accounts), SSO error rates.
- Secrets/Certs: rotation age, expiring keys/certs, failed rotations (SEV-1).
Failure Modes & Mitigations¶
- IdP outage: cached signing keys and grace tokens for short read-only windows; degrade non-critical flows.
- Clock skew: NTP enforcement; leeway on
nbf/expvalidation (≤ 2 minutes). - Stale entitlements: tokens carry snapshots; Config is source of truth for runtime decisions; short access-token lifetimes reduce drift.
- Compromised refresh token: rotate on every use; maintain reuse detection; revoke sessions on suspicion.
Solution Architect Notes¶
- Prefer OpenIddict for first-party control and rapid feature iteration; use federation to honor enterprise SSO requirements without coupling domain models to external IdPs.
- Keep tokens small and short-lived; push dynamic decisions to Config and policy engines.
- Enforce workload identity + mTLS ubiquitously; treat any secret-based fallbacks as temporary waivers with expiry.
- Model authorization outside the UI; all decisions must be verifiable at API boundaries and auditable.
Multi-Tenancy Strategy¶
Purpose¶
Define how the platform identifies, isolates, and governs tenants across the stack. This section standardizes tenant resolution, isolation levels (pooled/schema/database), configuration/flags enforcement, and onboarding & migration flows so products can scale safely from trials to large enterprises without redesign.
Tenancy Model Overview¶
- Tenant as first-class identity: every request, job, event, and data row is associated with exactly one
tenantId(or an allowed system actor). - Edition-aware policies: quotas, features, and SLO overlays are resolved at runtime per tenant.
- Security & observability invariants: tenant context is mandatory at ingress, persisted with data, and present on all telemetry.
Tenant Resolution¶
Resolution precedence (strict):
- Header —
x-tenant-id(authoritative in service-to-service calls) - Host —
subdomain.example.com→tenantIdmapping - Path —
/t/{tenantId}/...(supported for specific APIs) - Token — JWT claim
tenantId(validated but not preferred for multi-tenant APIs)
If the gateway detects ambiguity or mismatch (e.g., header vs host disagree), the request is rejected with a 400 including problem details and a correlation ID.
Resolver flow (edge):
flowchart LR
A[Request Arrives] --> B{Has x-tenant-id?}
B -- Yes --> C[Validate & Normalize Id]
B -- No --> D{Subdomain present?}
D -- Yes --> E[Lookup mapping -> tenantId]
D -- No --> F{Path /t/{id}?}
F -- Yes --> C
F -- No --> G{Token has tenantId?}
G -- Yes --> C
G -- No --> X[Reject 400: tenant_ambiguous]
C --> H{Tenant active & region allowed?}
H -- Yes --> I[Attach tenant to context, continue]
H -- No --> X
Isolation Levels¶
| Isolation | Description | When to Use | Data Guarding | Strengths | Trade-offs |
|---|---|---|---|---|---|
| Pooled | Shared schema + tables, tenantId column on all rows |
Trials, SMB, moderate scale | Repo guards + RLS (Row-Level Security) | Highest density, lowest cost | Hot-tenant contention; noisy neighbor risk |
| Schema-per-Tenant | Dedicated schema per tenant in same DB | Mid-market, heavier customizations | Schema scoping + connection factory | Easier per-tenant backup/restore; reduced contention | Higher catalog bloat; ops overhead |
| Database-per-Tenant | Dedicated database/server per tenant | Enterprise, regulatory isolation | Network isolation + DB-level IAM | Strongest isolation; independent lifecycle | Highest cost; cross-tenant reporting complexity |
Promotion path: pooled → schema → database, triggered by tenant size, SLO breach risk, or regulatory needs. Promotions are online using CDC-based sync and dual-writes during cutover (see Migration).
Tenancy Enforcement (defense in depth)¶
| Layer | Enforcement Mechanism | Mandatory Checks |
|---|---|---|
| Gateway | Resolver → inject x-tenant-id; deny ambiguous; edition-aware rate limit |
Token validation, tenant status (active/suspended), region allow-list |
| Service API | Policy filters & guards | Require tenant in context; cross-tenant IDs rejected |
| Domain Logic | Tenant-scoped commands/queries | Invariants include tenantId; never accept client-provided cross-tenant references |
| Repository/DAL | RLS or tenant filters; parameterized queries | WHERE tenant_id = @tenantId always; no string-concatenated SQL |
| Messaging | Envelope headers (tenantId, traceId, edition); scoped consumers |
Consumers reject missing/foreign tenant headers; per-tenant DLQ segregation |
| Cache | Tenant-scoped keys | cache:{tenantId}:{key}; no shared mutable data |
| Storage/Blob | Tenant prefix & ACLs | tenants/{tenantId}/...; private containers; tenant KMS policies (optional) |
| Observability | Required attributes on spans/logs/metrics | tenantId, edition, traceId present; queries default to tenant scope |
Configuration, Flags & Entitlements¶
- Resolution order: platform defaults → product defaults → edition pack → tenant overrides → (optional) user context.
- Flag evaluation: low-latency via cache (Redis) with consistent hashing; cache entries are tenant-scoped and short-lived.
- Entitlements in tokens: treated as snapshots; definitive decision uses Config at request time for drift-free enforcement.
Data Residency & Regional Routing¶
- Residency attribute on tenant (e.g.,
EU-WEST,US-EAST) selected at onboarding or via enterprise contract. - Routing at gateway directs requests to the region’s workload plane; cross-region access is denied unless an explicit policy allows it.
- Data stores are regionally isolated with Private Link; cross-region replication follows DR policy (RPO/RTO).
Onboarding & Lifecycle¶
States: requested → provisioning → active → suspended → deleted
sequenceDiagram
participant U as Tenant Admin
participant GW as Gateway
participant TEN as Tenant Mgmt
participant META as SaaS Core Metadata
participant CONF as Config/Flags
participant BILL as Billing
participant IDP as Identity
U->>GW: Sign up / create tenant
GW->>TEN: create_tenant(request)
TEN->>META: seed_product_edition(entitlements)
TEN->>CONF: seed_default_flags(tenantId, edition)
TEN->>BILL: create_subscription(plan)
BILL-->>TEN: subscription.pending
TEN->>IDP: provision_realm(tenant claims, roles)
TEN-->>GW: provisioning_complete
Note over TEN: State = active
GW-->>U: Activation success + admin invite
Suspend/Resume/Delete
- Suspend → revoke sessions; freeze subscription; block writes (read-only mode optional).
- Delete → two-phase: soft-delete with grace → hard-delete (after retention/erasure workflows).
Migration & Promotion Flows¶
Use cases
- Hot tenant promotion from pooled → schema → database.
- Region move for residency or latency.
- Edition-driven data shape change (e.g., enabling advanced features).
Approach (pooled → schema/db):
- Prepare: Create target schema/DB; provision IAM and RLS.
- Sync: Enable CDC or change feed; backfill historical data; start dual-writes.
- Cutover: Drain inflight ops; flip tenant connection mapping at resolver; verify read/write health.
- Finalize: Disable dual-writes; decommission old partition after retention window.
Zero-downtime guardrails
- Idempotent writes; natural keys stable across partitions.
- All services obtain tenant-specific connection info via Tenant Directory cache (with TTL and fast invalidation).
- Feature flag
tenant.migration.read_onlytoggles to protect critical sections.
Observability & SLOs (tenancy-centric)¶
| Signal | Target | Notes |
|---|---|---|
| Tenant onboarding p95 | ≤ 60s | create → active |
| Resolver failure rate | < 0.01% | ambiguous/missing tenant |
| Cross-tenant access violations | 0 | treated as SEV-1 |
| Promotion cutover duration | ≤ 60s | dual-write window bounded |
| Flag evaluation latency p95 | ≤ 5 ms | local/Redis-backed |
Dashboards include per-tenant views for latency, error rates, quota consumption, and migration progress.
Security & Privacy Notes¶
- Tenant authority lives in Tenant Management; all other contexts validate inbound
tenantIdagainst their read model. - No cross-tenant joins in read models unless explicitly marked “multi-tenant analytics” and routed through safe aggregation pipelines.
- Erasure support: tenant-owned PII deletions orchestrated via workflow; audit remains immutable with tokenized references.
Failure Modes & Mitigations¶
- Ambiguous resolution: 400 with diagnostics; require explicit header; emit audit event.
- Noisy neighbor: edition-aware throttling at edge; per-tenant queue partitioning; promote isolation level.
- Stale connection mapping: short TTL + cache bust on migration; fallback to directory lookup.
- Cross-tenant leak bug: automated tenant-fence tests in CI; runtime guard that verifies returned rows belong to request tenant (sample-based).
- Region outage: failover only for tenants whose contracts permit cross-region DR; others remain isolated per residency policy.
Solution Architect Notes¶
- Start all tenants pooled; automate promotion paths and keep them routine—not exceptional.
- Prefer RLS where supported; otherwise enforce repository guards and property-based testing to prove scoping.
- Keep the Tenant Directory authoritative for connection/partition info; never embed static routing in code.
- Treat tenant context as non-optional telemetry—it’s the first dimension for debugging, scaling, and support.
Event-Driven Backbone & Contracts¶
Purpose¶
Establish a canonical, versioned event backbone that connects bounded contexts with loose coupling and reliable delivery. Standardize the event envelope, headers, topics/queues, outbox/inbox patterns, idempotency, and DLQ handling, so teams can ship independently while maintaining a stable integration surface.
Principles¶
- Event-first collaboration: Services publish domain facts; consumers react and build local read models.
- Canonical envelope: All events carry the same required headers; payloads are versioned and backwards-compatible.
- At-least-once + idempotency: Producers use outbox; consumers use inbox + idempotency keys.
- Tenant isolation: Events are tenant-scoped by default; cross-tenant payloads are prohibited unless flagged as aggregate analytics.
- Observable by design: Every event includes telemetry context and is trace-linked to causative actions.
Canonical Envelope (CloudEvents-aligned)¶
Headers (required)
type— semantic event name with version suffix, e.g.,tenant.created.v1id— globally unique event id (ULID/GUID)source— service/bounded-context name, e.g.,tenant-svcspecversion—1.0time— RFC3339 timestamptraceId— W3Ctraceparentcorrelation idtenantId— authoritative tenant identityedition— edition at the time of emission (snapshot)schemaVersion— semantic version of the data payload (e.g.,1.0.0)partitionKey— defaulttenantId(for ordering at consumer/queue level)key— idempotency key for the business entity / sequence (e.g., subscription id)
Payload (data)
- Domain-specific fields (no PII unless absolutely required; prefer references and lookups).
Event Bus Topology¶
flowchart LR
subgraph Producers
TEN[Tenant Svc] -->|Outbox| EB(Service Bus Topics)
BILL[Billing Svc] -->|Outbox| EB
CONF[Config Svc] -->|Outbox| EB
USE[Usage Svc] -->|Outbox| EB
IDP[Identity Svc] -->|Outbox| EB
NOTIF[Notifications] -->|Outbox| EB
AUD[Audit Svc] -->|Outbox| EB
end
EB -->|Subscriptions| TEN_SUB[Tenancy Subscriptions]
EB --> BILL_SUB[Billing Subscriptions]
EB --> CONF_SUB[Config Subscriptions]
EB --> USE_SUB[Usage Subscriptions]
EB --> AUD_SUB[Audit Archive]
EB --> NOTIF_SUB[Delivery Workers]
classDef svc fill:#0b6,stroke:#094,color:#fff;
classDef bus fill:#234,stroke:#123,color:#fff;
classDef sub fill:#357,stroke:#234,color:#fff;
Conventions
- Topic-per-domain (e.g.,
tenant-events,billing-events,config-events,usage-events,identity-events,notifications-events,audit-events). - Subscription-per-consumer with optional filters (SQL filters on
type,tenantId,edition). - DLQ per subscription; DLQ contents are immutable and auditable.
Versioning Strategy¶
- Event
typeincludes a major payload version (.v1,.v2). - Additive changes (new fields) do not bump major; consumers must be tolerant readers.
- Breaking changes create a new type (
tenant.created.v2). Old and new may coexist during migration. - Deprecation window announced in contracts; observability verifies consumer adoption.
Outbox / Inbox / Idempotency¶
Outbox (producer)
- Transactionally stores pending events with business state changes.
- Background dispatcher publishes to bus with retry/backoff and exactly-once handoff semantics to the bus (effectively at-least-once end-to-end).
Inbox (consumer)
- Stores processed event ids/keys to de-duplicate.
- Idempotency key chosen per aggregate (e.g.,
subscriptionId,userId,flagName@version).
Reentrancy rules
- Handlers must be idempotent and side-effect-safe.
- Use sagas for long-running processes; each step commits with an idempotency boundary.
DLQ & Replay¶
- DLQ contracts: poison messages are never modified; metadata records the failures and handler stack.
- Replay tools: Operator-driven jobs pull DLQ batches → run through isolation workers with circuit breakers and quarantine on repeated failure.
- Observability: DLQ depth, age, and replay success rate are first-class metrics.
- Retention: DLQs retained ≥ 30 days (configurable), Audit retains summary references.
Sample Events (JSON)¶
1) Tenant Created
{
"type": "tenant.created.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3C4D",
"source": "tenant-svc",
"specversion": "1.0",
"time": "2025-09-29T10:15:30Z",
"traceId": "00-7e0d...-01",
"tenantId": "t-123",
"edition": "Standard",
"schemaVersion": "1.0.0",
"partitionKey": "t-123",
"key": "t-123",
"data": {
"name": "Acme Ltd",
"region": "EU-WEST",
"ownerUserId": "u-789"
}
}
2) Tenant Updated
{
"type": "tenant.updated.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3C5E",
"source": "tenant-svc",
"specversion": "1.0",
"time": "2025-09-29T10:17:00Z",
"traceId": "00-7e0d...-02",
"tenantId": "t-123",
"edition": "Standard",
"schemaVersion": "1.0.0",
"partitionKey": "t-123",
"key": "t-123",
"data": {
"changes": {
"edition": { "old": "Free", "new": "Standard" }
}
}
}
3) Subscription Activated
{
"type": "billing.subscription.activated.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3C6F",
"source": "billing-svc",
"specversion": "1.0",
"time": "2025-09-29T11:00:00Z",
"traceId": "00-9a1c...-01",
"tenantId": "t-123",
"edition": "Enterprise",
"schemaVersion": "1.1.0",
"partitionKey": "t-123",
"key": "sub-5566",
"data": {
"subscriptionId": "sub-5566",
"plan": "Enterprise",
"startDate": "2025-10-01"
}
}
4) Usage Meter Recorded
{
"type": "usage.meter.recorded.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3C7G",
"source": "usage-svc",
"specversion": "1.0",
"time": "2025-09-29T11:05:12Z",
"traceId": "00-bb2d...-03",
"tenantId": "t-123",
"edition": "Enterprise",
"schemaVersion": "1.0.0",
"partitionKey": "t-123",
"key": "meter:api_calls:2025-09-29T11:05:00Z",
"data": {
"meter": "api_calls",
"amount": 37,
"windowStart": "2025-09-29T11:05:00Z",
"windowSizeSec": 60
}
}
5) Config Flag Updated
{
"type": "config.flag.updated.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3C8H",
"source": "config-svc",
"specversion": "1.0",
"time": "2025-09-29T11:10:00Z",
"traceId": "00-cc3e...-01",
"tenantId": "t-123",
"edition": "Enterprise",
"schemaVersion": "1.0.0",
"partitionKey": "t-123",
"key": "flag:betaFeature",
"data": {
"flag": "betaFeature",
"value": true,
"actor": "u-42"
}
}
6) User Invited
{
"type": "identity.user.invited.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3C9I",
"source": "identity-svc",
"specversion": "1.0",
"time": "2025-09-29T11:12:34Z",
"traceId": "00-11aa...-01",
"tenantId": "t-123",
"edition": "Standard",
"schemaVersion": "1.0.0",
"partitionKey": "t-123",
"key": "u-42",
"data": {
"userId": "u-42",
"email": "ada@example.com",
"roles": ["member"]
}
}
7) Notification Delivered
{
"type": "notify.message.delivered.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3D0J",
"source": "notifications-svc",
"specversion": "1.0",
"time": "2025-09-29T11:20:00Z",
"traceId": "00-22bb...-05",
"tenantId": "t-123",
"edition": "Standard",
"schemaVersion": "1.0.0",
"partitionKey": "t-123",
"key": "msg-9012",
"data": {
"messageId": "msg-9012",
"channel": "email",
"template": "welcome",
"status": "delivered"
}
}
8) Audit Action Logged
{
"type": "audit.action.logged.v1",
"id": "01HZXZ0N4Q6T3V3Y1W1A2B3D1K",
"source": "audit-svc",
"specversion": "1.0",
"time": "2025-09-29T11:22:10Z",
"traceId": "00-33cc...-07",
"tenantId": "t-123",
"edition": "Standard",
"schemaVersion": "1.0.0",
"partitionKey": "t-123",
"key": "audit:2025-09-29:01",
"data": {
"actor": "u-42",
"action": "TENANT_UPDATE",
"resource": "Tenant/t-123",
"result": "success"
}
}
Contract Governance¶
- Source of Truth:
contracts/events/*.jsonwith schema definitions and examples. - Review Process: Producer PR must include schema update + changelog; consumer teams subscribe to contract watch alerts.
- Linting: CI validates envelope headers, allowed field names, and version policies.
- Deprecation: Old types remain for a defined window; producers publish both old/new until consumers confirm adoption.
Observability & SLOs¶
| Signal | Target |
|---|---|
| Event publishing success | ≥ 99.99% |
| End-to-end lag p95 | ≤ 60s (producer commit → consumer handle) |
| Replay success rate | ≥ 99% |
| DLQ age p95 | ≤ 15m |
| Duplicate handling incidents | 0 (idempotent consumers) |
Spans include producer.service, consumer.service, type, tenantId, key. Metrics cover topic depth, subscription lag, DLQ size/age, and handler error rates.
Failure Modes & Mitigations¶
- Duplicate deliveries: Inbox + idempotent handlers; use entity
keyto ignore repeats. - Schema drift: Tolerant readers; canary consumers validated in pre-prod; contract tests in CI.
- Bus outage/backlog: Producer backpressure, KEDA scale-out of consumers, DLQ thresholds with alerting.
- Poison messages: Quarantine to DLQ on max attempts; root-cause analysis required before replay.
- Cross-tenant leakage risks: Envelope validator rejects events without/with mismatched
tenantId; audit every violation as SEV-1.
Solution Architect Notes¶
- Prefer topic-per-domain plus subscription-per-consumer to keep ownership clear.
- Treat events as write-optimized facts; resist synchronous request/response coupling.
- Keep payloads lean and stable; link to resources rather than embedding large objects.
- Make replay safe by ensuring handlers are pure functions over input + idempotent side effects.
Service Taxonomy & Interfaces¶
Purpose¶
Define the platform service catalog aligned with bounded contexts, including responsibilities, exposed interfaces (REST/events/webhooks), storage choices, SLO posture, and cross-cutting constraints (multi-tenancy, security, observability). Provide component-level sketches for representative services to guide implementation.
Catalog by Bounded Context¶
| Context | Service | Core Responsibility | Interfaces | Persistence (default) | Notes |
|---|---|---|---|---|---|
| SaaS Core Metadata | Metadata API | Products, editions, features, entitlements; pack composition | REST (admin), Events | Azure SQL | Upstream for Config, Billing, Identity |
| Identity | IdP (OpenIddict) | OIDC/OAuth2, roles/scopes, federation | OIDC, REST, Events | Azure SQL | SCIM (Enterprise) |
| Tenant Management | Tenant API, Tenant Worker | Tenant lifecycle, residency, directory, promotions | REST, Events, Jobs | Azure SQL | Authoritative tenantId |
| Billing | Billing API, Billing Saga | Plans, subscriptions, invoices, payment provider ACL | REST, Events, Webhooks | Azure SQL | Sagas orchestrate payments |
| Config & Feature Flags | Config API, Flag Evaluator | Flags, edition overrides, kill switches | REST, Events | Azure SQL + Redis | Low-latency evaluation |
| Usage & Metering | Usage Ingest, Usage Query | Meter capture, quota checks, aggregates | Events (in), REST (read) | Azure SQL (+ cold storage) | Emits usage.meter.recorded |
| Notifications | Notify API, Delivery Worker | Email/SMS/Webhooks, templates, branding | REST, Events, Webhooks | MongoDB + Queue | Provider ACLs |
| Audit & Compliance | Audit Append, Audit Query | Immutable append-only log, exports | Events (append), REST (read) | Azure SQL | Long retention |
| AI Orchestration | AI Orchestrator, Agent Workers | Agentic flows, scaffolding, test/doc generation | REST, Jobs, Events | Blob/Queue | Guardrails & audit |
Interface conventions
- REST: external/administrative commands and strongly consistent queries.
- Events: domain facts, outbox/inbox; topic-per-domain.
- Webhooks: signed outbound notifications for tenant systems (Billing/Notify).
- Jobs: KEDA/Hangfire for scheduled tasks and DLQ replayers.
Cross-Cutting Constraints¶
- Multi-tenancy: All operations scoped by
tenantId. Repository layer enforces RLS/guards. - Security: OIDC at edge, mTLS inside, least privilege, no static secrets (workload identity).
- Observability: OTel spans/logs/metrics with
tenantId,edition,traceId; dashboards per service. - SLO posture (baseline): 99.9% availability; p95 read ≤ 200 ms / write ≤ 350 ms; event lag p95 ≤ 60 s.
Exemplar 1 — Metadata API (SaaS Core Metadata)¶
Responsibilities
- Manage Products, Editions, Features, Entitlements, Quotas, and Pack composition.
- Act as source of truth for billing plans, entitlement catalogs, and edition overlays.
- Publish canonical events when definitions change (e.g.,
product.updated,edition.created).
Exposed Interfaces
- REST (admin/product owner):
POST /api/metadata/products(create product)POST /api/metadata/editions(define edition)POST /api/metadata/features(define feature/entitlement)GET /api/metadata/products/{id}
- Events (producer):
metadata.product.created.v1metadata.edition.created.v1metadata.feature.updated.v1
Storage
- Azure SQL: relational model for products, editions, features, entitlements.
- Immutable history tables for auditability.
Component Sketch
flowchart LR
API[Metadata API] --> APP[App Layer / Validation]
APP --> REPO[Metadata Repository]
REPO --> SQL[(Azure SQL)]
APP --> OUTBOX[Outbox Dispatcher]
OUTBOX --> BUS[[Service Bus]]
style API fill:#7d5dfc,stroke:#4b33b3,color:#fff
Security & Tenancy
- Administrative operations require
metadata.managescope, reserved for platform/product owners. - Queries may be global (cross-tenant), but tenant-facing tokens only receive read-scoped access to permitted metadata.
Observability
- Spans:
metadata.createProduct,metadata.updateEdition. - Metrics: product definition latency, cache hit ratio for entitlement lookups.
Exemplar 2 — Tenant Service (Tenant Management)¶
Responsibilities
- Create/activate/suspend/delete tenants.
- Manage residency, directory, isolation level (pooled/schema/db).
- Seed defaults (edition, flags) and emit
tenant.*events.
Exposed Interfaces
- REST (admin):
POST /api/tenants(create)PATCH /api/tenants/{id}(suspend/resume)POST /api/tenants/{id}:promote(upgrade isolation level)
- Events:
tenant.created.v1,tenant.updated.v1,tenant.promoted.v1
Storage
- Azure SQL (authoritative tenant table, residency, isolation level).
- Redis (Tenant Directory cache).
Component Sketch
flowchart LR
API[Tenant API] --> APP[Policies / Directory]
APP --> REPO[Tenant Repository]
REPO --> SQL[(Azure SQL)]
APP --> OUTBOX[Outbox -> Bus]
APP --> JOBS[Promotion Worker]
JOBS --> SQL
style API fill:#1f6,stroke:#0b5,color:#fff
Exemplar 3 — Billing Service (with Saga Orchestrator)¶
Responsibilities
- Manage plans/editions, subscriptions, invoicing.
- Integrate with payment providers via Payment ACL.
- Coordinate long-running billing flows (activation, retries, dunning) via Saga.
Exposed Interfaces
- REST (admin/tenant):
POST /api/billing/subscriptions(create/upgrade/downgrade)GET /api/billing/subscriptions/{id}POST /api/billing/invoices/{id}:pay
- Events (producer):
billing.subscription.activated.v1,.suspended.v1,.invoiced.v1,.payment.received.v1 - Events (consumer):
usage.meter.recorded.v1,tenant.created.v1 - Webhooks (outbound): signed
invoice.created,payment.failed
Storage
- Azure SQL (subscriptions, invoices, ledger)
- Optional blob for invoice PDFs
Component Sketch (Saga)
flowchart TB
CMD[Billing API] --> SM[Subscription Saga]
SM --> ACL[Payment Provider ACL]
SM --> OUTBOX[Outbox]
SM --> SUBREPO[Subscription Repo]
ACL --> PAYEXT[(Payment Gateway)]
SUBREPO --> SQL[(Azure SQL)]
OUTBOX --> BUS[[Service Bus]]
BUS -->|usage.meter.recorded| SM
style SM fill:#f8a,stroke:#c06,color:#222
style BUS fill:#357,stroke:#234,color:#fff
Saga Flow (high-level)
- Receive
subscription.create. - Reserve plan; request payment via ACL.
- On success: persist, publish
subscription.activated. - On failure: retry (exponential backoff) → dunning →
subscription.suspended.
Security & Tenancy
- All commands must include
tenantId; ACL enforces signature verification with provider. - Monetary amounts validated server-side; no client-trusted totals.
Observability
- Spans:
billing.saga.step.*,payment.acl.request. - Metrics: authorization approval rate, dunning success rate, event lag.
Exemplar 4 — Config & Feature Flags Service¶
Responsibilities
- Store and evaluate feature flags, edition overrides, kill switches.
- Provide low-latency decision APIs for UI/services.
- Broadcast changes as events for cache invalidation.
Exposed Interfaces
- REST:
GET /api/config/flags/{key}(evaluate with context)POST /api/config/flags(create/update)POST /api/config/overrides(tenant/edition-specific)
- Events (producer):
config.flag.updated.v1,config.override.updated.v1 - Events (consumer):
billing.subscription.activated.v1,tenant.created.v1
Storage
- Azure SQL (authoritative flag definitions, overrides)
- Redis (evaluation caches with tenant scoping)
Component Sketch
flowchart LR
API[Config API] --> APP[Evaluator/Policy Engine]
APP --> CACHE[(Redis)]
APP --> REPO[Config Repo]
REPO --> SQL[(Azure SQL)]
APP --> OUTBOX[Outbox -> Service Bus]
style API fill:#19a974,stroke:#0e7a55,color:#fff
Security & Tenancy
- Mutations require
config.manage+tenant_adminrole. - Evaluation requires authenticated context; anonymous evaluation disabled except for public flags.
Observability
- Spans:
config.evaluate,config.invalidate. - Metrics: p95 evaluation latency ≤ 5 ms, cache hit%, invalidation fanout time.
Interface Outlines (selected contracts)¶
Tenant API
POST /api/tenants→201 Created+tenant.created.v1PATCH /api/tenants/{id}→200 OK+tenant.updated.v1POST /api/tenants/{id}:promote→202 Accepted(async job)
Billing API
POST /api/billing/subscriptions→202 Accepted(saga) + eventsGET /api/billing/subscriptions/{id}→200 OK
Config API
GET /api/config/flags/{key}?tenantId=...→200 OK{ "value": true, "reason": "tenant-override" }POST /api/config/flags→201 Created+config.flag.updated.v1
Solution Architect Notes¶
- Metadata API acts as upstream catalog: Billing, Config, and Identity enrichments must not hardcode editions or features.
- Always seed tenant entitlements from Metadata events → Config → Identity claims.
- Keep Metadata auditable and append-only where possible; edition/feature definitions should be traceable across time.
- Treat Metadata changes as high-risk operations; enforce RBAC + approval workflows.
Data & Storage Architecture¶
Purpose¶
Define a portable, Azure-first data architecture that supports multi-tenancy, high throughput, and auditability without locking products into one storage technology. Standardize relational vs document choices, partitioning strategies, read models, caching/search integration, and retention/archival so generated solutions can scale from trials to large enterprises with predictable cost and reliability.
Store Selection Principles¶
| Concern | Default Choice | Alternatives | Rationale |
|---|---|---|---|
| System-of-record, strong consistency | Azure SQL (PostgreSQL compatible alternative) | Managed Postgres/MySQL | ACID, schema control, transactional outbox |
| Large/variable payloads (templates, messages) | MongoDB (optional) | Cosmos DB (Mongo API), Azure Blob | Flexible schema, document access |
| Caching / fast eval (flags, sessions) | Redis | In-memory per pod (with eviction) | Low-latency, TTL, pub/sub invalidation |
| Search / discovery | Azure AI Search (opt-in) | Elastic/OpenSearch | Full-text, facets, suggesters |
| Analytics / cold storage | Blob Storage (Parquet) | ADLS Gen2 | Cheap, durable, columnar query via Spark/SQL |
| Event backbone | Azure Service Bus | Kafka | Durable pub/sub, topics, DLQs |
Rule of thumb: Write models → relational, large/optional payloads → document/blob, query flexibility → search/read models, analytics/retention → blob.
Logical Data Model (core entities)¶
erDiagram
TENANT ||--o{ USER : has
TENANT ||--o{ SUBSCRIPTION : owns
TENANT ||--o{ CONFIG_FLAG : configures
TENANT ||--o{ USAGE_RECORD : generates
TENANT ||--o{ AUDIT_EVENT : emits
PRODUCT ||--o{ EDITION : offers
EDITION ||--o{ ENTITLEMENT : contains
SUBSCRIPTION }o--|| EDITION : references
SUBSCRIPTION }o--|| TENANT : belongs_to
ENTITLEMENT }o--|| FEATURE : grants
USER {
string userId PK
string tenantId FK
string email
string roleSet
}
TENANT {
string tenantId PK
string name
string region
string isolationLevel // pooled|schema|database
string status // active|suspended|deleted
}
PRODUCT {
string productId PK
string name
}
EDITION {
string editionId PK
string productId FK
string name // Free|Standard|Enterprise|Custom
}
FEATURE {
string featureId PK
string name
}
ENTITLEMENT {
string entitlementId PK
string editionId FK
string featureId FK
json limits
}
SUBSCRIPTION {
string subscriptionId PK
string tenantId FK
string editionId FK
datetime startDate
datetime endDate
string status // active|suspended|canceled
}
CONFIG_FLAG {
string flagKey PK
string tenantId FK
json value
string scope // global|edition|tenant|user
datetime updatedAt
}
USAGE_RECORD {
string usageId PK
string tenantId FK
string meter
bigint amount
datetime windowStart
int windowSec
}
AUDIT_EVENT {
string auditId PK
string tenantId FK
string actor
string action
string resource
datetime occurredAt
json details
}
The Metadata (Product/Edition/Feature/Entitlement) drives Subscription and Config decisions; Usage feeds Billing; Audit is append-only.
Physical Partitioning & Isolation¶
Per-Context baseline
| Context | Physical Store | Partition Key | Secondary Partition | Notes |
|---|---|---|---|---|
| Identity / Tenant / Billing / Config | Azure SQL | tenantId |
id per table |
RLS/tenant guards or DAL filters |
| Notifications (payloads/templates) | MongoDB | tenantId |
templateId |
Large blob-like docs, TTL indices |
| Usage (raw) | Azure SQL (hot) + Blob (cold) | tenantId + time |
meter | Hot window (≤ 90 days), compaction |
| Audit | Azure SQL (append-only) | tenantId + time |
actor | Immutable; export to blob for eDiscovery |
Isolation levels
- Pooled (default): all tenants share schema, enforced by RLS/guards; indexed on
(tenantId, <business key>). - Schema-per-tenant: separate schema for hot tenants; connection factory selects schema based on Tenant Directory.
- Database-per-tenant: separate DB + network isolation for enterprise/regulatory cases.
Promotion triggers
- Hot partition (p99 latency, lock contention), data volume thresholds, contractual isolation, or regulatory residency.
Read Models & CQRS¶
- Write models: normalized, transactional tables per bounded context.
- Read models: denormalized projections built from events for UX queries and support dashboards (e.g., “tenant overview,” “subscription health,” “usage summary”).
- Projections: idempotent consumers with inbox and checkpointing; rebuild on demand.
- Search adapters: project to Azure AI Search indices (e.g., tenants, invoices, audit summaries) with indexer jobs and soft deletes.
Indexing & Query Patterns¶
- Composite indices:
(tenantId, naturalKey)as leading index across write tables. - Time-series:
USAGE_RECORD (tenantId, windowStart DESC)for sliding windows; partitioned aggregation tables for hourly/daily rollups. - Audit:
(tenantId, occurredAt DESC)coveringactor, action, resource. - Avoid cross-tenant joins; analytics should use aggregations in a separate pipeline.
Caching Strategy¶
| Cache | Scope & Key | Invalidation | TTL |
|---|---|---|---|
| Tenant Directory | tenant:{tenantId}:dir |
on tenant.updated/promoted | 5–30s |
| Flag Evaluation | flag:{tenantId}:{flagKey} |
config.flag.updated |
30–120s |
| Entitlement Snapshot | ent:{tenantId}:{editionId} |
billing.subscription.* / metadata.* |
5m |
| OIDC JWKs & Metadata | idp:jwks |
rotation events | 15m |
- Prefer cache-aside; never cache PII without encryption.
- Use Redis hash keys for compact, multi-field storage and partial invalidation.
Data Retention & Archival¶
| Data Class | Hot Retention | Cold Retention | Storage | Notes |
|---|---|---|---|---|
| Usage (raw) | 90 days | 2 years (Parquet) | Blob | Aggregated hourly/daily kept hot |
| Audit | 1 year hot | 7 years cold | SQL + Blob | Immutable, exportable bundles |
| Notifications payloads | 30 days | N/A | Mongo | Store minimal PII; tokenize where possible |
| Invoices/PDFs | 1 year hot | 7 years cold | SQL + Blob | Legal/compliance governed |
| Config history | 180 days | N/A | SQL | Versioned changes for debugging |
| Tokens/session | 24h | N/A | Redis | No PII; rotate frequently |
Retention windows are edition- and region-configurable; legal holds suspend deletion jobs.
Backup, DR & Consistency¶
- SQL: PITR enabled; geo-redundant backups; weekly full + daily diff + 5-min log backups (policy baseline).
- Mongo: point-in-time snapshots; validate TTL indexes.
- Blob: versioning and soft-delete enabled; lifecycle rules to move to cool/archive tiers.
- RPO/RTO: platform baseline RPO ≤ 15 min, RTO ≤ 4 h (overrides for Enterprise).
- Consistency: outbox ensures atomic write + event; consumers are at-least-once, idempotent.
Security & Privacy¶
- Encryption at rest: TDE for SQL, SSE for Blob, disk encryption for Mongo; CMEK where required.
- Encryption in transit: TLS 1.2/1.3; mTLS inside cluster; Private Link for data stores.
- PII minimization: store only necessary attributes; logs are redacted at sink.
- Row-Level Security (RLS): preferred; otherwise enforce DAL guards and property-based tests.
- Secrets: no inline secrets in tables; use Key Vault references; rotate keys ≤ 90 days.
Data Lifecycle & Governance¶
- Schemas as code: migrations via EF Core/NHibernate + migration approvals (gated in CI).
- CDC: used for online migrations, projections rebuilds, and promotion (pooled→schema/db).
- Data quality checks: constraints + lightweight DQ jobs (nullability, referential integrity, outliers).
- Change review: ADR required for breaking schema changes; contract tests validate read models.
- Right to erasure: orchestrated delete with tombstones; audit holds tokenized references.
Example Physical Topology (simplified)¶
flowchart LR
subgraph Hot Path
SQL[(Azure SQL\nWrite Models)]
REDIS[(Redis Cache)]
BUS[[Service Bus]]
end
subgraph Projections
CONSUMER[Projectors (Inbox/Idempotent)]
READDB[(SQL Read Models)]
SEARCH[(Azure AI Search)]
end
subgraph Cold Path
BLOB[(Blob Storage\nParquet/Exports)]
ANALYTICS[(Spark/SQL)]
end
BUS --> CONSUMER
CONSUMER --> READDB
CONSUMER --> SEARCH
CONSUMER --> BLOB
SQL <-- cache-aside --> REDIS
Observability & SLOs¶
- Signals: DB p95 latency, lock wait time, failed migrations, cache hit %, projection lag, index health, DLQ depth for projectors.
- SLOs:
- Write p95 ≤ 350 ms; Read p95 ≤ 200 ms
- Projection lag p95 ≤ 60 s
- Cache hit ≥ 85% for flag evaluation
- Backup success 100%; restore drill quarterly
Failure Modes & Mitigations¶
| Failure | Impact | Mitigation |
|---|---|---|
| Hot partition / noisy neighbor | Latency spikes | Promote tenant to schema/DB; shard by tenant; add covering indexes |
| Long-running transactions | Lock contention | Break writes into smaller batches; use optimistic concurrency |
| Projection backlog | Stale read models | KEDA-scale projectors; partial rebuild by tenant; prioritize critical topics |
| Cache stampede | Thundering herd | Request coalescing; jittered TTL; background refresh |
| Schema drift | Consumer breaks | Contract tests; additive changes; deprecation windows |
| Data corruption | Incident/rollback | PITR restore to side DB; compare via checksums; rehydrate projections |
Solution Architect Notes¶
- Start with pooled SQL and event-driven projections; add document/search only for proven needs.
- Keep natural keys stable to enable safe migration/rebuilds.
- Make retention a product setting—not hardcoded—so legal/compliance overlays can adjust.
- Prefer append-only (Audit/Usage) plus compaction for analytics; avoid destructive changes in hot paths.
Messaging & Integration Patterns¶
Purpose¶
Standardize asynchronous communication across the platform using MassTransit with Azure Service Bus (ASB). Define topologies, routing conventions, retry/backoff/jitter, saga orchestration vs choreography, and compensation so services remain loosely coupled, reliable, and observable under load.
Topology & Conventions¶
Domain-first topology
- Topic-per-domain:
tenant-events,billing-events,config-events,usage-events,identity-events,notifications-events,audit-events. - Subscription-per-consumer: one subscription per logical consumer service (optionally per tenant segment or feature).
- DLQ-per-subscription: automatic dead-letter queues with invariant retention.
Naming
- Exchange/Topic:
<domain>-events - Subscription:
<consumer-svc>.<purpose>(e.g.,billing-svc.rating,config-svc.invalidate) - Queue (commands):
<svc>-cmd(optional; we favor events over commands across contexts) - Saga state store tables:
<svc>_saga_<name>
Message envelopes (recap)
- Required headers:
type,id,time,traceId,tenantId,edition,schemaVersion,partitionKey,key(idempotency). - Partitioning:
tenantIdas defaultpartitionKeyto increase locality and ordering per tenant.
MassTransit Setup Patterns (C# excerpts)¶
Bus configuration (ASB + outbox)
services.AddMassTransit(x =>
{
x.SetKebabCaseEndpointNameFormatter();
x.AddEntityFrameworkOutbox<AppDbContext>(o =>
{
o.QueryDelay = TimeSpan.FromSeconds(1);
o.DuplicateDetectionWindow = TimeSpan.FromMinutes(10);
o.UseBusOutbox();
});
// Consumers, Sagas, Activities
x.AddConsumersFromNamespaceContaining<TenantCreatedConsumer>();
x.AddSagaStateMachine<SubscriptionSaga, SubscriptionState>()
.EntityFrameworkRepository(r => r.ConcurrencyMode = ConcurrencyMode.Optimistic);
x.UsingAzureServiceBus((context, cfg) =>
{
cfg.Host(builder.Configuration["ServiceBus:ConnectionString"]);
cfg.MessageTopology.SetEntityNameFormatter(new DomainTopicFormatter());
cfg.UseMessageRetry(r =>
r.Exponential(5, TimeSpan.FromMilliseconds(200), TimeSpan.FromSeconds(10), TimeSpan.FromMilliseconds(50)));
cfg.UseInMemoryOutbox(); // consumer-side dedupe window
cfg.UseConcurrencyLimit(64);
cfg.ConfigureEndpoints(context);
});
});
Consumer template (idempotent + inbox)
public class TenantCreatedConsumer : IConsumer<TenantCreated>
{
private readonly Inbox _inbox;
public async Task Consume(ConsumeContext<TenantCreated> ctx)
{
if (!await _inbox.TryBeginAsync(ctx.Message.Id)) return; // dedupe
try
{
// side-effect safe work
await HandleAsync(ctx.Message);
await _inbox.CompleteAsync(ctx.Message.Id);
}
catch (Exception ex)
{
await _inbox.FailAsync(ctx.Message.Id, ex);
throw; // allow retry policy to engage
}
}
}
Retry, Backoff & Jitter¶
| Scenario | Policy | Max Attempts | Initial Delay | Max Delay | Notes |
|---|---|---|---|---|---|
| Transient network | Exponential + jitter | 5 | 200 ms | 10 s | Default consumer policy |
| Rate-limited upstream | Decorrelated jitter | 6 | 500 ms | 30 s | Honor Retry-After |
| Idempotent publish | Linear | 3 | 2 s | 6 s | Outbox ensures once-per-change |
| External webhooks | Exponential + cap | 8 | 1 s | 5 min | Move to DLQ after cap |
| Payment ACL | Saga step specific | 4 | 1 s | 60 s | Backoff grows per failure stage |
Rules
- Retry only idempotent operations; non-idempotent steps must be guarded by saga state.
- Add small random jitter to prevent thundering herds.
- After retries exhausted → DLQ with full context and last exception chain.
Orchestration vs Choreography¶
| Pattern | When to use | Mechanism | Pros | Cons |
|---|---|---|---|---|
| Choreography | Independent reactions to a fact (e.g., tenant.created) |
Events only | Simple, scalable, low coupling | Harder to visualize global flow |
| Orchestration | Multi-step, long-running business process (e.g., subscription activation) | Saga coordinates steps | Centralized state, compensations explicit | Orchestrator coupling; needs strong tests |
Guideline: Prefer choreography for enrichment and projections. Use sagas only for business-critical, multi-step flows with compensations (payments, migrations).
Saga Orchestration (Billing Example)¶
State machine outline
stateDiagram-v2
[*] --> Pending
Pending --> Authorizing : command.received
Authorizing --> Active : payment.captured
Authorizing --> Dunning : payment.failed(retry_exhausted)
Dunning --> Suspended : dunning.failed
Active --> Suspended : subscription.payment.overdue
Suspended --> Active : payment.captured
Active --> [*]
Key design
- Idempotency: Correlate by
subscriptionId(saga key). Each event mutates state exactly once. - Compensation: If invoice created but payment fails → issue credit note, revert entitlements, emit
billing.subscription.suspended. - Timeouts: Each step has a receive timeout; when exceeded, move to next compensating step (e.g., dunning).
Activities (MassTransit)
ReservePlanActivity→RequestPaymentActivity→ActivateEntitlementsActivity- On failure path:
IssueCreditActivity→SuspendSubscriptionActivity
Compensation Patterns¶
| Failure | Compensation | Notes |
|---|---|---|
| Payment captured but entitlements not activated | Refund/credit note, revoke token grants | Ensure idempotent credit issuance |
| Tenant promoted but mapping not switched | Roll back mapping; keep dual-writes; retry cutover | Feature flag tenant.readonly protects |
| Email sent to wrong template | Send corrective message, mark original as superseded | Immutable log kept in Audit |
| Usage over-reported | Emit usage.adjustment event; recompute invoice |
Maintain adjustment ledger |
Technique
- Compensations are first-class commands/events with their own audit entries.
- No “delete-and-forget”; always append corrective facts.
Error Handling & DLQ Strategy¶
Handler contract
- Validate envelope invariants (tenant, trace, type) first; reject missing or mismatched context (SEV-1 if produced internally).
- Side effects must be wrapped with transaction boundaries; record idempotency outcome.
Dead-lettering
- Criteria: max delivery count exceeded, non-transient exceptions (validation, authorization), poison messages (schema mismatch).
- DLQ payload: original message + headers + last exception + handler name + attempt count.
- Replay: operator-driven tool with safe-mode (dry-run), rate limiters, circuit breakers, and quarantine on re-poisoning.
Monitoring
- Metrics: subscription lag, handler error rate, DLQ depth/age, saga timeout count.
- Alerts: threshold breaches trigger runbooks (scale consumers, pause producers, enable backpressure at gateway).
Integration Patterns (edge & third parties)¶
- Webhooks (outbound): Signed (HMAC), retry with backoff up to 24h, idempotency via
Event-Id, age limit (drop after TTL). - Inbound third-party callbacks: Terminate at Gateway; validate signature & age; enqueue to inbox queue for processing.
- Payment ACL: Isolate providers’ SDKs; map transient vs permanent failures; unify errors to domain codes.
Observability¶
- Spans:
publish,consume,saga.step,saga.compensate,webhook.request,webhook.retry. - Attributes:
tenantId,type,key,sagaId,deliveryAttempt,queue,subscription. - Logs: structured with exception chains; no PII. Include
producer.service,consumer.service. - Metrics: end-to-end event lag p95, publish success rate, handler retries, DLQ age p95, saga step durations.
Performance & Scalability¶
- KEDA triggers on ASB metrics (queue length, lag); scale consumers horizontally.
- Use prefetch and concurrency limits tuned per handler (e.g., heavy CPU vs I/O bound).
- For hot tenants, prefer per-tenant subscriptions/queues to isolate and prioritize critical customers.
Security & Tenancy¶
- Tenant scoping: reject events missing
tenantId; never emit cross-tenant payloads unless flagged as analytics and routed separately. - mTLS inside the cluster; ASB credentials via workload identity.
- Least privilege SAS/RBAC roles per consumer/producer; rotate keys ≤ 90 days.
- Data minimization: events should reference entities, not embed sensitive data.
Solution Architect Notes¶
- Use outbox everywhere—it’s the linchpin of dependable messaging.
- Keep sagas lean and deterministic; external calls go through activities/ACLs with clear retry/timeout semantics.
- Focus on idempotency: it’s cheaper to ensure than to diagnose duplicates in production.
- Make DLQ a first-class workflow with replay tooling and tight observability—assume it will be used.
AI-First & Agentic Orchestration¶
Purpose¶
Embed safe, deterministic AI assistance into the SaaS Factory to accelerate planning, scaffolding, documentation, tests, and operational hygiene—without bypassing security, change control, or human judgment. Agents propose and scaffold; humans own approval and deployment. All agent actions are audited, observable, and reversible.
Agent Roles (factory-internal)¶
| Agent | Primary Outcomes | Typical Triggers | Key Outputs |
|---|---|---|---|
| Product Blueprint Agent | Turn a product idea/recipe into an initial blueprint aligned to platform patterns | New product request; edition/pack change | HLD skeleton, context map updates, ADR drafts |
| Service Scaffolder Agent | Create service projects from templates (API/Worker/Saga), wiring tenancy, OTel, health, outbox | “Add service” request; new bounded context | Repo branches/PRs with solution scaffold, CI pipeline YAML |
| Contract & SDK Agent | Generate/validate OpenAPI & event schemas, produce language SDKs | New/changed endpoints or events | contracts/*.yaml/json, SDK packages, contract tests |
| Test Generator Agent | Propose unit/contract/E2E tests; synth checks for SLOs | New feature PRs; failing SLOs | Test projects, synthetic monitors |
| Docs & Runbook Agent | Produce developer and operator docs; incident runbooks | New service or ADR; post-incident tasks | docs/*, ops/runbooks/* |
| Operability Agent | Create dashboards/alerts, SLOs, chaos experiments | New service onboard; SLO drift | Grafana dashboards, alert rules, Chaos experiments |
| Security & Compliance Agent | Enforce guardrails (SBOM, SAST/SCA, secret scans), suggest remediations | PR validation; dependency changes | Policy reports, license notices, PR comments |
Optional, tenant-facing assistants (e.g., support Q&A) are separate products behind strong data-isolation and are not assumed by the factory baseline.
Skills & Tooling (curated, least-privilege)¶
| Skill/Tool | Scope (allow-listed) | Notes |
|---|---|---|
| Template Engine | Read /factory/templates/**; write to feature branch |
No direct writes to main |
| Repo API | Create branches/PRs; comment on PR; no force-push | PR labels must indicate “ai-generated” |
| Contract Linter | Validate OpenAPI/event schemas, versioning | Fails on breaking changes without ADR |
| MassTransit/Outbox Scaffolder | Wire messaging boilerplate | Enforces outbox/inbox and OTel |
| IaC Generator (Bicep/Pulumi) | Produce env-scoped stacks | Read-only cloud; deploy only via pipeline |
| Policy Gate Runner | SAST/SCA, license, SBOM, secret scan | Blocks PR if violations found |
| Observability Pack | OTel wiring, dashboard JSON, alert rules | Requires SLO metadata |
| Doc/Runbook Composer | Create/update Markdown and Mermaid | Must include change rationale and rollbacks |
All tools are invoked through Semantic Kernel with capabilities/RBAC matching the agent role. No tool exposes secrets or production tokens to agents.
Determinism & Safety¶
- Model & prompting discipline: temperature ≈ 0, pinned models, structured prompts with explicit acceptance criteria.
- Deterministic artifacts: agents produce diff-minimal changes; every artifact carries an
x-origin: ai/<agent>/<hash>footer. - Reproducibility: inputs (recipe, prompts, params) logged; artifacts hashed; “re-run with same seed” supported.
- Human-in-the-loop: agents cannot merge; required human review + green policy gates.
- Data minimization: agents read only non-PII source; no tenant data; redaction in logs.
- No live mutations: agents never call production APIs; all changes flow via PR → CI → deploy.
- Denylist/Allowlist: explicit denied actions (e.g., dropping DB tables); tools must guard server-side.
Orchestration Flow (Semantic Kernel)¶
sequenceDiagram
participant PO as Product Owner
participant Orchestrator as SK Orchestrator
participant Blueprint as Blueprint Agent
participant Scaffolder as Service Scaffolder
participant Contracts as Contract & SDK Agent
participant Tests as Test Generator Agent
participant Ops as Operability Agent
participant Repo as Git/PR
participant CI as CI/CD Pipeline
PO->>Orchestrator: "New product/service recipe"
Orchestrator->>Blueprint: Plan HLD/ADRs from recipe
Blueprint-->>Orchestrator: HLD diff & ADR drafts
Orchestrator->>Scaffolder: Generate service skeleton(s)
Scaffolder->>Repo: Open PR with code + pipelines
Orchestrator->>Contracts: Generate/validate contracts
Contracts->>Repo: Commit schemas + contract tests
Orchestrator->>Tests: Add unit/contract/E2E tests
Tests->>Repo: Commit tests
Orchestrator->>Ops: OTel wiring, dashboards, alerts
Ops->>Repo: Commit observability pack
Repo->>CI: PR checks (SAST/SCA, SBOM, tests, policies)
CI-->>Repo: Status (pass/fail)
Note over Repo,PO: Human reviews; merge if green
Guardrails & Policies¶
- Identity & RBAC: agents authenticate as service principals with minimal scopes; actions are auditable and reversible.
- Policy gates: PRs must pass security scans, contract tests, SLO linters, and governance checks (ADR present for major changes).
- Content safety: prompt-injection filters, tool-call allowlists, and output sanitizers (no secrets, no PII).
- Rate & cost controls: per-agent quotas; budget alerts; offline mode fallback.
- Rollbacks: every change includes a generated rollback playbook and
revert.shscript when applicable.
Observability & SLOs (AI operations)¶
| Signal | Target | Rationale |
|---|---|---|
| PR acceptance rate (ai-generated) | ≥ 80% | Indicates useful, review-ready output |
| Policy gate pass rate | ≥ 95% | Low violation rate |
| Mean time to scaffold service | ≤ 10 min | Responsiveness |
| Post-merge incident rate attributable to AI changes | 0 | Safety baseline |
| Reproducibility check (hash match) | 100% | Determinism |
Traces include agentId, tool, operation, repo, branch, artifactHash. Logs exclude secrets and PII.
Failure Modes & Mitigations¶
| Failure | Symptom | Mitigation |
|---|---|---|
| Hallucinated API/contract | PR fails contract lint | Contract Agent uses ground-truth schemas; block merge |
| Unsafe infra change | Policy gate fails | IaC policies (OPA/Conftest/Azure Policy) block; generate safer alt |
| Non-deterministic output | Hash mismatch | Pin model/version; lower temperature; freeze template version |
| Over-scaffolding (bloat) | Large diff, unclear value | Orchestrator trims plan; human prompt to narrow scope |
| Tool misuse | Unauthorized API calls | Tool RBAC + server-side enforcement; audit & revoke token |
Solution Architect Notes¶
- Treat the Orchestrator as a planner, not a do-everything agent; delegate to small, single-purpose agents with narrow tools.
- Never let agents bypass PR review or production deployment pipelines.
- Prefer small, composable PRs: one agent outcome per PR for easier review and rollback.
- Make agent outputs teach-able: include rationale, caveats, and links to standards so humans learn and trust the system.
Observability by Design¶
Purpose¶
Establish a uniform, always-on telemetry pipeline built on OpenTelemetry for traces, metrics, and logs. Every service, job, and gateway must emit consistent signals enriched with multi-tenant context to enable SLO monitoring, fast incident response, and data-driven improvements. Observability is a non-removable guardrail.
Telemetry Architecture¶
Pipeline (high level)
- Instrumentation: .NET OTel SDK in every service (HTTP, gRPC, MassTransit, SQL, Redis, custom spans).
- Export: OTLP → Collector (agent/sidecar/daemonset) →
- Traces: Tempo/Jaeger (or Azure Monitor/OpenTelemetry Distro)
- Metrics: Prometheus (scraped from Collector or services) → Grafana
- Logs: Structured JSON → Loki/Elastic/Azure Monitor Logs
- Dashboards & Alerts: Grafana + Alertmanager (or Azure Monitor Alerts).
- Correlation: W3C tracecontext (
traceparent,tracestate) propagated end-to-end (gateway ↔ services ↔ jobs).
flowchart LR
APP[Apps & Jobs (.NET + OTel)] --> COL[OTel Collector]
GATE[Edge Gateway] --> COL
BUS[Service Bus Consumers] --> COL
COL --> TRC[(Traces)]
COL --> MET[(Metrics)]
COL --> LOGS[(Logs)]
TRC --> GRAF[Grafana/Tempo]
MET --> GRAF
LOGS --> GRAF
Required Attributes (span/log/metric labels)¶
| Key | Source | Purpose |
|---|---|---|
traceId, spanId |
OTel | Correlation |
tenantId |
Gateway/Service | Multi-tenant scoping |
edition |
Gateway/Config | Entitlement context |
routeId / operation |
Gateway/Service | API and domain op naming |
service.name, service.version |
OTel resource | Ownership & rollout correlation |
messaging.system, message.type, message.key |
MassTransit | EDA correlation & idempotency checks |
db.system, db.statement(redacted) |
ADO/EF | Hot query detection |
http.method, http.route, http.status_code |
ASP.NET Core | API SLOs |
job.name, job.idempotencyKey |
Jobs | Job tracing |
agentId (AI) |
AI Orchestration | Agent provenance |
PII is never recorded in attributes or logs. Use hashed or tokenized identifiers when needed for joins.
Span & Metric Conventions¶
Span naming
- HTTP:
http <VERB> <route>(e.g.,http GET /api/tenants/{id}) - Domain:
<context>.<usecase>(e.g.,billing.rate-usage) - Messaging:
consume <type>/publish <type> - Jobs:
job <name>(e.g.,job dlq-replay)
Key metrics (Prometheus/OpenTelemetry Metrics)
- API:
http_server_duration_seconds(histogram),http_requests_total,http_errors_total - Messaging:
consumer_lag_seconds,messages_processed_total,consumer_retry_total,dlq_depth - DB:
db_client_duration_seconds,db_connections_in_use,deadlocks_total - Cache:
cache_hit_ratio,cache_latency_seconds - Jobs:
job_duration_seconds,job_failures_total,job_retries_total - SLO helpers:
slo_availability_ratio,slo_latency_budget_burn,slo_error_budget_remaining
Logging
- Structured JSON with
timestamp,level,message,traceId,tenantId,edition,service,operation,exception.type,exception.stack(hash or summarized),fields{}. - Log levels:
Infofor state transitions,Warnfor transient/backoff,Errorfor failed business operations,Fatalfor process crash.
SLO Monitoring & Error Budgets¶
Golden paths (examples)
- Auth token issuance: p95 ≤ 150 ms, availability ≥ 99.95%
- Tenant onboarding (create → active): p95 ≤ 60 s, success rate ≥ 99%
- Config evaluation: p95 ≤ 5 ms, availability ≥ 99.99%
- Event ingestion to handling lag: p95 ≤ 60 s
- Billing subscription activation (saga): p95 ≤ 2 min, failure < 0.5%
Error budget policy
- If monthly SLO breaches consume >50% of budget: freeze non-urgent releases, run reliability epics.
- Hard guardrails (no budget): Security, Telemetry integrity (missing
tenantId/traceId), PII leakage.
Example Dashboards (sections)¶
- Edge/Gateway
- Requests/sec by
routeId, p95/99 latency, 4xx/5xx rates, rate-limit hits, canary weight, ejected backends.
- Requests/sec by
- Service Health (per context)
- API duration histogram, dependency latency (DB/Redis/HTTP), error rate, CPU/mem, pod restarts, rolling version mix.
- Messaging
- Topic depth, subscription lag, consumer throughput, retry counts, DLQ depth/age, replay outcomes.
- Jobs
- Success/failure, duration percentiles, retries, next runs, DLQ replayer status.
- Tenancy Overview
- Top tenants by RPS, hottest partitions, promotion candidates, edition distribution, per-tenant error rates.
- AI Orchestration
- PR acceptance %, gate pass rate, artifact hash reproducibility, cost usage.
Alerting (examples)¶
| Alert | Condition | Severity | Playbook |
|---|---|---|---|
| API latency SLO breach | p95 http_server_duration_seconds > SLO for 5m |
P1 | Scale out, check DB latency, roll back canary |
| Availability dip (service) | 5xx rate > 2% for 10m | P1 | Trigger incident, flip traffic to stable, examine error budget |
| Consumer backlog | consumer_lag_seconds > 120s or dlq_depth increasing |
P1 | Scale consumers, inspect DLQ samples, enable backpressure |
| Missing tenant context | % spans without tenantId > 0.01% |
P0 | Block deployment; fix middleware; postmortem required |
| Cache miss storm | cache_hit_ratio < 70% for 10m |
P2 | Warm cache, check invalidation loop |
| Key rotation nearing expiry | cert/key days_to_expiry < 14 | P2 | Rotate keys, verify JWKS/certs in all environments |
Sampling & Cost Controls¶
- Traces: start 10–20% head sampling; tail-based sampling at Collector for error/slow traces to 100% keep.
- Logs: info logs rate-limited/bursty controls; DEBUG logging disabled in production (enable via scoped feature flag for time-boxed windows).
- Metrics: prefer histograms over raw timings; align bucket bounds with SLOs.
- Cardinality hygiene: bound label sets (e.g., truncate user/IDs, never include raw emails).
Instrumentation Checklist (service template defaults)¶
.NETOTel AspNetCore, HttpClient, SqlClient, MassTransit instrumentations enabled.- Correlation middleware ensures
tenantId,edition,traceIdon all spans/logs. - Health endpoints:
/healthz(liveness),/readyz(readiness); exportotel.instrumentation.version. - Startup failure telemetry: if app crashes before OTel init, fall back to minimal bootstrap logger writing to stderr with correlation IDs.
- Synthetic checks tagged as
client=syntheticto avoid skewing user metrics.
Data Safety & Privacy¶
- Redaction at source: PII scrubbers for logs; SQL statement text parameter values redacted.
- Audit alignment: link audit IDs in spans for admin actions; audit remains immutable.
- Access control: Observability backends require SSO + RBAC; tenant-scoped views for support, platform-wide for operators.
Failure Modes & Mitigations¶
| Failure | Symptom | Mitigation |
|---|---|---|
Missing tenantId in spans/logs |
Hard to triage multi-tenant issues | Block deployment (CI check); runtime guard drops context-less requests in non-public routes |
| Trace flood / high cost | Collector pressure, storage bills | Tail-based sampling; dynamic sampling policies; drop noisy internal spans |
| High cardinality labels | Prometheus OOM, slow queries | Static label allowlist; bounds on IDs; drop/rename labels |
| Collector outage | Telemetry gaps | Local buffering with retry; secondary collector; alert on exporter failures |
| Log PII leakage | Compliance risk | PII scanners in CI + runtime; auto-redaction; SEV-0 response |
Solution Architect Notes¶
- Define SLOs before code; dashboards and alerts are part of the service template and PR checked.
- Favor tail-based sampling to keep costs predictable while capturing the right traces.
- Tie deployment safety to observability: no public endpoint until OTel signals and dashboards are verified.
- Make tenant and edition the first-class dimensions in every query and board—support depends on it.
Security & Threat Modeling¶
Purpose¶
Embed security-first principles into the SaaS platform design. Apply STRIDE threat modeling across all trust boundaries, enforce Zero Trust at edge and service-to-service, manage secrets with rotation, and ensure supply-chain integrity through SBOM and artifact signing. Security is non-negotiable and integrated into CI/CD and runtime.
Threat Model (STRIDE per Boundary)¶
| Boundary | Spoofing | Tampering | Repudiation | Information Disclosure | Denial of Service | Elevation of Privilege |
|---|---|---|---|---|---|---|
| Edge / API Gateway | Token forgery, session hijack | Request manipulation | Missing request logs | Sensitive data leakage via headers | Flooding / DDoS | Path traversal, privilege escalation |
| Service-to-Service | Forged service identity | Malicious message injection | Missing correlation | Overexposed events (tenant mix) | Queue flooding | Overbroad service scopes |
| Data Stores | Stolen credentials | SQL injection, blob tampering | No audit trails | Unencrypted data at rest | Hot partition overload | Misconfigured RLS, schema promotion abuse |
| CI/CD & Supply Chain | Build agent spoofing | Artifact tampering | Build logs altered | Secrets leakage in logs | Malicious PR floods pipeline | Malicious dependency injection |
| Tenant-Facing UIs | Phishing via iframe injection | DOM/XSS | Lack of audit for admin actions | Misconfigured CORS | Brute-force auth | Unscoped RBAC flaws |
| AI Agents | Prompt injection | Malicious PR diffs | No AI provenance logs | Leakage of sensitive configs | Resource abuse (cost spike) | Agent bypassing guardrails |
Zero Trust Principles¶
- Authenticate everything: OAuth2/OIDC at edge; mTLS for inter-service; workload identities instead of static secrets.
- Authorize explicitly: RBAC/ABAC checks enforced per operation; deny by default.
- Audit everywhere: Immutable logs, linked to
traceId,tenantId,actor. - Segment aggressively: per-service network policies, per-tenant data isolation (RLS/DB).
- Assume breach: Red team simulation, chaos-security drills, SEV-0 for detected cross-tenant leakage.
Secrets & Key Management¶
- Azure Key Vault (default) for secrets, keys, certs.
- Rotation policies:
- Tokens/keys ≤ 90 days
- Certificates ≤ 1 year (auto-rotate with Key Vault Certificates)
- Workload identity: Replace connection strings with AAD Managed Identity or Kubernetes Workload Identity.
- Zero secrets in repo: Pre-commit hooks + CI scanners enforce.
Mutual TLS & Service Mesh Posture¶
- Service-to-service: All gRPC/HTTP calls secured with mTLS via service mesh (e.g., Linkerd, Istio) or YARP with TLS termination.
- Certificates: Issued and rotated automatically by mesh/Key Vault integration.
- Trust store: Centralized CA; only platform-issued certs accepted.
- Fallback: If mesh unavailable, services must still enforce TLS 1.2/1.3 with pinned certs.
Input Validation & Hardening¶
- Gateway: global request validation (size, schema, rate).
- Services: contract validation (OpenAPI, JSON Schema), input sanitization.
- Databases: use parameterized queries (NHibernate/EF Core), enforce RLS.
- Containers: minimal base images, read-only root FS, drop Linux capabilities, seccomp/AppArmor.
- Kubernetes: pod security baseline; deny privileged containers.
Supply Chain Security¶
- SBOM: generated per build (CycloneDX/Syft).
- Dependency scanning: SCA in CI (Dependabot/Renovate + OSS Review Toolkit).
- Artifact signing:
cosignfor container images; verify at deploy. - Provenance: SLSA Level 3 baseline: attestations for build, source, and dependencies.
- Policy gates: block deploy if unsigned or vulnerable artifact.
Security Profile per Component¶
| Component | AuthN/AuthZ | Data Security | Audit | Hardening |
|---|---|---|---|---|
| API Gateway | OAuth2/OIDC, JWT validation, DPoP (optional) | TLS termination | Full request/response logs | WAF rules, DoS protection |
| Identity Service | OIDC/OAuth2, MFA for admin | SQL TDE, hashed passwords (Argon2id) | Token issuance logs | SCIM support, federation |
| Tenant Service | Scoped to tenantId |
RLS enforced | Tenant lifecycle logs | Residency enforcement |
| Billing | Signed callbacks, PCI DSS zone | Ledger immutability, encryption | Invoice/payment audit | Saga compensations logged |
| Config Service | RBAC (admin only mutations) | Encrypt secrets/flags | Config change history | Kill switch validation |
| Usage Service | Signed ingestion events | Partitioned by tenant | Usage adjustment ledger | Throttled ingestion |
| Notifications | Signed webhook delivery | Encrypt templates (at rest) | Delivery logs | Provider ACL |
| Audit Service | Append-only | Immutable schema, retention policy | Non-repudiation | Export controls |
| AI Orchestration | RBAC + agent identity | Redaction in prompts | AI provenance logs | Prompt-injection filters |
CI/CD Security Gates¶
- Static Analysis (SAST): Roslyn analyzers, SonarQube.
- Secrets scanning: GitLeaks/TruffleHog in pipeline.
- Dependency scanning (SCA): alerts + fail on critical.
- Container scan: Anchore/Grype/Trivy in CI.
- Infra as Code scan: Checkov/OPA on Bicep/Pulumi.
- PR checks: ADR presence, SBOM generated, cosign verification.
Observability & SLOs (Security)¶
- Signals: token issuance latency, failed login rate, RBAC policy evaluation, key rotation lag, SBOM freshness.
- SLO targets:
- Token issuance success ≥ 99.95%
- Cross-tenant access violations = 0
- Secrets exposure in repo = 0
- Key/cert expiry incidents = 0
- Critical CVEs unpatched ≤ 7 days
Failure Modes & Mitigations¶
| Failure | Symptom | Mitigation |
|---|---|---|
| Tenant data leakage | Cross-tenant records in query/event | Block at DAL/middleware; automated test harness; SEV-0 incident response |
| Expired certs | Service call failures | Auto-rotation with Key Vault; monitor expiry; staged rotation tests |
| Supply chain attack | Dependency injection of malware | SBOM + provenance; strict registry mirror; signed artifacts only |
| Prompt injection (AI) | Malicious PR diffs or secret exfiltration attempts | Input sanitizers; denylist filters; tool-call allowlist |
| Stolen secrets | Credential replay | Rotate keys; enforce workload identity; block static credentials |
Solution Architect Notes¶
- Treat security violations as SEV-0—no tolerance for cross-tenant leakage or unsigned artifacts.
- Bake security gates into templates so generated services are secure-by-default.
- Prefer short-lived credentials + workload identity over all else.
- Keep attack surface minimal: fewer protocols, minimal images, tight RBAC.
- Run quarterly threat model reviews and refresh STRIDE table as system evolves.
Resilience & Reliability Policies¶
Purpose¶
Establish a resilience-first posture to ensure services remain predictable, recoverable, and observable under failures. Standardize timeouts, retries, circuit breakers, bulkheads, and fallbacks across all services, supported by chaos experiments and steady-state SLO validation.
Core Reliability Patterns¶
| Pattern | Application | Notes |
|---|---|---|
| Timeouts | Every external call (HTTP/gRPC/DB/cache/message broker) | Default 2–5s; domain-specific overrides; no unbounded waits |
| Retries | Transient failures (network, 429, 5xx) | Exponential backoff + jitter; max attempts ≤ 5; idempotent only |
| Circuit Breakers | Downstream repeated failures | Half-open after cooldown; reject early to prevent cascades |
| Bulkheads | Resource partitioning | Thread pool isolation per dependency; partition tenants if noisy |
| Fallbacks | Non-critical paths (config, cache, search) | Return defaults/stale data; never bypass auth/billing/audit |
| Graceful Degradation | Feature toggles | Disable non-essential modules to preserve core flows |
| Idempotency | API + event handlers | Safe retries, deduplication; enforced by outbox/inbox |
Policy Matrix¶
| Dependency | Timeout | Retry Policy | Circuit Breaker | Bulkhead | Fallback |
|---|---|---|---|---|---|
| API Gateway → Service | 3s | 3x exponential (100ms → 2s) | Open after 5 failures / 30s | Route pool partitioning | Serve cached config/error envelope |
| Service → DB (SQL) | 5s | 2x linear (1s) | Open after 3 failures / 10s | Connection pool isolation per tenant | None; fail-fast |
| Service → Cache (Redis) | 2s | 3x exponential (50ms → 1s) | Open after 5 failures / 15s | Separate connection pools | Stale read (optional) |
| Service → Service Bus | 5s | 5x exponential (100ms → 5s) | Open after 10 failures / 60s | Consumer concurrency limits | Store in outbox for retry |
| Service → External Provider (Payments, Email) | 10s | 4x exponential (500ms → 30s) | Open after 5 failures / 60s | Thread pool partition | Retry + DLQ; tenant notified |
| Jobs (background) | Job-level SLA (e.g., 60s) | Retry up to 3x | Abort if circuit open | Queue partitioning | Reschedule; quarantine tenant batch |
Chaos Engineering & Failure Injection¶
Goals
- Validate that resilience patterns protect SLOs under fault injection.
- Prove steady-state system remains within error budget even under chaos.
Scenarios
- Network latency: inject 500 ms–2 s delays between services.
- Dependency crash: kill DB/read replica; service bus outage.
- Message flood: burst 10× normal tenant traffic.
- Certificate expiry: simulate expired TLS/mTLS certs.
- Cache poisoning: inject stale config flags.
- AI agent misuse: agent suggests invalid scaffolding PRs.
Execution
- Use chaos mesh/litmus in AKS, or Azure Chaos Studio.
- Run during off-peak; abort if critical SLO breach > 15 min.
- Record metrics: error rate, p95 latency, recovery time.
Resilience Testing & Steady-State SLOs¶
Steady-State Hypothesis: “The system continues to meet its defined SLOs under injected failures.”
Test Harness
- Synthetic checks (login, tenant onboarding, subscription activation, config flag evaluation).
- Run continuously; inject chaos in staging and periodically in prod (with guardrails).
SLO Verification
- Auth token issuance: ≥ 99.95% success under chaos.
- Tenant onboarding: p95 ≤ 60s with one DB node offline.
- Config evaluation: p95 ≤ 10ms even if Redis down (fallback applies).
- Billing saga: compensates gracefully within 2 min if provider unreachable.
Observability Integration¶
- OTel spans mark retries, fallback paths, circuit breaker open/half-open.
- Metrics:
retry_attempts_total,circuit_open_total,fallback_requests_total,bulkhead_rejections_total. - Dashboards: reliability view per service, error budget burn rate, chaos experiment outcomes.
- Alerts:
- Circuit breaker open rate > 5% (P1)
- Retry storm > 1000/min (P1)
- Error budget burn > 20% in 24h (P0 freeze new releases)
Failure Modes & Mitigations¶
| Failure | Symptom | Mitigation |
|---|---|---|
| Retry storm → overload | High CPU, cascading failure | Add jitter; exponential backoff; max retry cap |
| Circuit breakers too aggressive | False positives; degraded UX | Tune thresholds; log open/close events |
| Bulkhead misconfig | Resource starvation across tenants | Separate thread pools; enforce quotas |
| Fallback leakage | Sensitive data exposed in defaults | Redact PII; fallback only to safe defaults |
| Chaos experiment causes outage | Real users impacted | Run in staging; prod chaos with guardrails and kill switch |
Solution Architect Notes¶
- Bake resilience policies into templates—engineers shouldn’t reinvent timeouts or retries.
- Ensure compensations are explicit, observable, and auditable.
- Treat chaos engineering as validation of design assumptions, not an afterthought.
- Tie resilience metrics into error budget policies to guide release decisions.
Jobs & Scheduling¶
Purpose¶
Standardize background processing across the platform to handle recurring, delayed, or one-off workloads that do not fit into synchronous request/response or event-driven flows. Ensure jobs are idempotent, observable, and auditable, with UTC-based scheduling and clear operational runbooks for reliability.
Job Frameworks¶
| Type | Default Framework | Notes |
|---|---|---|
| Short-lived, ad hoc background work | KEDA + Service Bus Queue Consumers | Elastic scaling; event-driven jobs |
| Recurring & delayed jobs | Hangfire (SQL/Redis backend) | Cron-based recurring jobs; dashboards |
| Heavy/batch jobs | KEDA scaling with containerized workers | Scale based on queue depth/metrics |
| Maintenance jobs | Kubernetes CronJobs | Low-frequency (e.g., nightly cleanup, backups) |
Principle: Jobs are treated as first-class services with the same tenancy, security, and observability constraints as APIs.
Scheduling Strategy¶
- Time Standard: All schedules defined in UTC (no local timezones to avoid drift).
-
Recurrence:
-
Hangfire CRON expressions stored under
jobs/schedules/config. - Critical jobs documented with frequency, duration SLA, and recovery playbook.
- One-off jobs: Triggered via API or CLI, persisted in job store with
jobId,tenantId, and status. - Drift prevention: Time sync enforced across nodes (NTP).
Idempotency & Safety¶
- Idempotency key: every job carries a unique
jobKey=<jobName>:<scope>:<timestamp>(e.g.,invoice:tenant123:2025-10-01). - Retries: capped exponential backoff (max attempts configurable); retries logged with correlation IDs.
- Poison jobs: moved to DLQ table with full error context and manual replay tooling.
- Concurrency guard: distributed locks (e.g., SQL/Redis) to prevent double-execution of same job key.
- Cancellation: jobs respond to cancellation tokens; long-running jobs checkpoint progress.
Job Categories (examples)¶
| Category | Examples | Notes |
|---|---|---|
| Operational | Tenant promotion, DB migrations, cache warm-up | High-priority, operator-triggered |
| Business | Invoicing, subscription renewal, quota resets | Time-critical, audited |
| Data | Projections rebuild, DLQ replay, report generation | Idempotent; resumable |
| Notifications | Email/SMS campaigns, retries | Staggered sends to avoid spikes |
| Maintenance | Cleanup expired sessions, rotate keys, purge logs | Non-urgent; must not impact SLOs |
Observability & Telemetry¶
Traces
job.schedule(scheduled at time X)job.execute(span per execution, includesjobId,jobKey,tenantId)job.retryandjob.failevents
Metrics
job_duration_seconds(histogram per job type)job_success_total,job_failure_totaljob_retries_totaljob_scheduled_next{jobName}(gauge, Prometheus style)dlq_jobs_total,dlq_jobs_age_seconds
Dashboards
- Job health view: success/failure rates, retry counts, p95 execution times.
- DLQ board: job type, failure reason distribution, replay outcomes.
Operational Runbooks (baseline)¶
-
Job failed repeatedly
-
Inspect DLQ table entry.
- Review trace/log context.
- Fix underlying cause (data, downstream).
-
Trigger replay via job API (
POST /api/jobs/{id}:replay). -
Job backlog growing
-
Check KEDA scaling triggers.
- Scale out workers (manual override if needed).
- Verify no partition hot-spot (tenant/queue).
-
Clear DLQ separately.
-
Misconfigured CRON
-
Validate against
jobs/schedules/config. - Ensure UTC alignment.
-
Correct CRON expression; redeploy.
-
Stuck job
-
Cancel job via API.
- Mark as failed with checkpoint.
- Restart from last checkpoint.
Failure Modes & Mitigations¶
| Failure | Symptom | Mitigation |
|---|---|---|
| Duplicate execution | Two workers run same job | Use distributed lock; idempotency key |
| DLQ growth | Too many poison jobs | Replay tooling; operator alerts |
| Clock drift | Jobs run at wrong time | Enforce UTC; NTP sync across nodes |
| Long-running jobs hang | High latency, blocking resources | Cancellation tokens; checkpoints; watchdog alerts |
| Thundering herd on retries | Spikes after outage | Retry with jitter/backoff; stagger scheduling |
Solution Architect Notes¶
- No hidden jobs: every job must be defined in config and visible on dashboards.
- Always idempotent: jobs must tolerate retries and safe re-execution.
- Runbooks mandatory: each recurring job must include operational steps in
/ops/runbooks/jobs/. - Chaos test jobs: periodically inject job failures to validate retries and DLQ handling.
Configuration, Feature Flags & Edition Overrides¶
Purpose¶
Standardize configuration, feature flagging, and edition-specific overrides across the SaaS platform using external configuration/flag microservices (not embedded in the platform itself). Ensure safe rollouts, per-edition capability gating, kill-switches, and integration with Azure App Configuration and Key Vault for secrets.
Principles¶
- Externalization: all runtime configuration, flags, and edition overlays are retrieved from Config/Feature microservices, not embedded in code.
- Separation of concerns:
- Secrets → Key Vault
- Application settings → Azure AppConfig (or external Config service)
- Feature flags → Feature Flag Service (edition- and tenant-aware)
- Progressive rollout: gradual exposure of new features via percentage, tenant, or edition filters.
- Kill-switches: rapid disablement of features at runtime for stability or security.
- Edition overlays: baseline features defined in Metadata/Entitlements, overridden dynamically by edition or tenant flags.
Integration Pattern¶
Service startup flow
- Service authenticates via workload identity.
- Fetches configuration values (non-secret) from AppConfig or Config microservice.
- Fetches secrets from Key Vault (connection strings, API keys).
- Subscribes to config change notifications (events).
- Evaluates feature flags via Flag Evaluation API with context:
{ tenantId, edition, userId, environment }. - Uses cached flag values with TTL; invalidates on
config.flag.updatedevents.
Feature Flag Evaluation¶
Flag structure
{
"flagKey": "betaFeatureX",
"default": false,
"rules": [
{ "target": "edition:Enterprise", "value": true },
{ "target": "tenant:t-123", "value": true },
{ "target": "percentage:10", "value": true }
],
"killSwitch": false,
"updatedAt": "2025-09-29T11:00:00Z"
}
Evaluation precedence
- Kill-switch (if true → disabled globally).
- Tenant-specific override.
- Edition-specific override.
- Progressive rollout filters (e.g., percentage, region).
- Default value.
Example API (Config Service external)
GET /api/flags/{flagKey}?tenantId=t-123&edition=Standard
200 OK
{
"flagKey": "betaFeatureX",
"value": true,
"reason": "tenant-override"
}
Edition Overlays¶
- Baseline: Metadata API defines products, editions, and entitlements.
- Overlay: Config/Flag service applies dynamic rules on top (enable/disable features at runtime).
- Example:
- Edition: Standard includes Feature A.
- Config overlay disables Feature A for
tenant:t-999(due to contractual restriction). - Config overlay enables Feature B early for
tenant:t-123as part of beta.
Kill-Switches¶
- Definition: global flags (
kill:<featureKey>) that force-disable functionality. - Usage: emergency disablement of faulty/new features.
- Enforcement: evaluated in gateway and service middleware; takes precedence over tenant/edition overrides.
- Auditability: all kill-switch flips logged in Audit service (
audit.action.logged.v1).
Rollout Flows¶
Progressive rollout (percentage-based)
- Flag created as
default=false. - Rule applied:
percentage=5%for target feature. - Gradually increased to 25%, 50%, 100%.
- Rollout metrics tracked: error rates, adoption, SLO impact.
- Kill-switch available at every stage.
Targeted rollout (tenant or edition-based)
- Enterprise tenants get early access to new feature.
- Beta tenants opt-in by contract.
- Free/Standard remain unaffected until GA.
Operational Runbook
- Document rollout strategy in
/ops/runbooks/flags/<flagKey>.md. - Define: target tenants, monitoring metrics, rollback steps.
Observability¶
- Spans:
config.fetch,config.evaluate,flag.evaluate. - Metrics:
config_fetch_latency_seconds(p95 ≤ 50ms)flag_eval_latency_seconds(p95 ≤ 5ms, Redis cached)config_update_events_totalflag_killswitch_activated_total
- Dashboards: flag evaluation success rate, cache hit ratio, rollout coverage %.
- Alerts: kill-switch activation triggers P1 audit/notification; flag evaluation failures > 0.1% → P1 incident.
Security & Tenancy¶
- Secrets: only pulled from Key Vault with workload identity; rotated automatically.
- Config/Flags: always evaluated with
tenantId; anonymous calls disallowed. - Multi-tenant enforcement:
- RLS at config service.
- Tenant isolation in flag definitions.
- Edition scoping enforced at query layer.
Failure Modes & Mitigations¶
| Failure | Symptom | Mitigation |
|---|---|---|
| Config service outage | Features unavailable / mis-evaluated | Local cache with TTL; stale-while-revalidate |
| Kill-switch delay | Feature not disabled fast enough | Push invalidation events; force cache bust |
| Overlapping overrides | Confusing feature state | Precedence rules enforced; audit trail required |
| Flag drift between envs | Inconsistent tenant experience | Config sync job; environment drift monitor |
| Secrets leakage | PII in config | CI scans; Key Vault only; deny secrets in AppConfig |
Solution Architect Notes¶
- Treat Config & Feature Flags as external microservices—our SaaS platform integrates with them, but does not own them.
- AppConfig + Key Vault are integration points for base values/secrets, but dynamic logic lives in Config Service.
- Bake flag evaluation middleware into service templates so every new service is “flag-aware.”
- Document rollout/rollback strategies per flag—never release without a defined kill-switch.
API Design Standards & Versioning Policy¶
Purpose¶
Provide factory-wide API conventions so every generated product exposes a predictable, secure, and evolvable interface. External APIs are REST-first; internal service-to-service can additionally use gRPC where efficiency or streaming is beneficial. Contracts are spec-first (OpenAPI/Protobuf) with strong governance, multi-tenant invariants, and clear deprecation windows.
Protocol Posture¶
| Audience | Protocol | Usage |
|---|---|---|
| External (public) | REST/HTTP | CRUD and task-oriented endpoints; JSON over HTTPS; HATEOAS not required; Problem+JSON. |
| Internal (services) | gRPC (opt) | High-throughput, low-latency calls; streaming; strictly inside the mesh with mTLS. |
| Async | Events/Webhooks | Event-driven facts via Service Bus; signed webhooks for tenant integrations. |
Rule: prefer async events over synchronous cross-context calls. Use gRPC only when you need tight request/response between trusted services.
Versioning Strategy¶
Resource/API versions
- URI versioning for REST (public):
/v1/...,/v2/.... - Content negotiation (optional):
Accept: application/vnd.connectsoft.v1+json. - gRPC: package versioning (
package billing.v1;) and additive field numbering.
Compatibility rules
- Minor, additive changes (new fields) do not bump major.
- Breaking changes → new major (
v2) and new route set. - Sunset policy: minimum 12 months support for the previous major after
v{N+1}GA (Enterprise can contract longer). - Deprecation headers:
Deprecation,Sunset,Link: <changelog>; rel="deprecation"emitted by Gateway for deprecated versions.
Resource Modeling & Naming¶
- Plural nouns for collections:
/tenants,/subscriptions,/flags. - Hierarchy only where ownership is strict:
/tenants/{tenantId}/subscriptions/{id}. - Actions-as-subresources (no verbs in paths):
POST /subscriptions/{id}:cancelPOST /tenants/{id}:promote
- Idempotency for unsafe operations: clients send
Idempotency-Key(UUID). Server stores outcome for at least 24h.
Pagination, Filtering, Sorting¶
Pagination
- Cursor-based by default:
?cursor=<opaque>&limit=50 - Response:
{
"items": [ /*...*/ ],
"pageInfo": { "nextCursor": "eyJhIjoi...\"", "hasNextPage": true, "count": 50 }
}
Filtering
?filter=field1:eq:value1,field2:lt:10(simple RSQL-like), or discrete params for common fields.- Disallow filtering on sensitive/PII fields.
Sorting
?sort=+createdAt,-name(stable secondary sort byid).
Total counts
- Avoid for hot paths; expose separate
HEADor/statsif needed.
Standard Request & Response Conventions¶
Headers (required/standardized)
- Inbound:
Authorization: Bearer <JWT>,X-Tenant-Id(edge injects if not provided),X-Request-Id/Traceparent. - Rate limits:
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset. - Idempotency:
Idempotency-Key(unsafe methods). - Caching:
ETag,If-None-Match,Cache-Controlfor GETs where safe.
Error envelope (Problem+JSON)
{
"type": "https://docs.connectsoft.cloud/problems/tenant-ambiguous",
"title": "Tenant context is ambiguous",
"status": 400,
"traceId": "00-7e0d...-01",
"tenantId": "t-123",
"detail": "X-Tenant-Id header conflicts with token claim",
"errors": [ { "code": "TENANT_MISMATCH", "path": "$.headers.x-tenant-id" } ]
}
- Always include
traceId, and when present,tenantId. - Avoid PII in
detail.
Validation
- Use JSON Schema derived from OpenAPI components; reject unknown fields when
strict=trueflag is enabled (default for internal/gRPC).
Tenancy & Security Invariants¶
- Every authenticated request must resolve a single tenant; ambiguous →
400. - Authorization: scope + role check at edge; resource-level ABAC in service.
- Data access is tenant-scoped; cross-tenant resources are never addressable by ID alone.
- PII hygiene: redact in logs; never echo secrets.
REST Resource Examples (concise)¶
Tenants
POST /v1/tenants
GET /v1/tenants/{tenantId}
PATCH /v1/tenants/{tenantId}
POST /v1/tenants/{tenantId}:promote // isolation level upgrade
Billing
POST /v1/subscriptions // Idempotent; requires Idempotency-Key
GET /v1/subscriptions/{id}
POST /v1/invoices/{id}:pay
GET /v1/subscriptions?cursor=&limit=&sort=
Config & Flags
GET /v1/flags/{key}?context=user:... // evaluation; no anonymous
POST /v1/flags // admin only
POST /v1/overrides // tenant/edition overrides
gRPC Internal Standards (optional)¶
- Service boundaries mirror REST resources, but optimized for chatty or streaming ops (e.g., usage ingest):
- Package:
usage.v1, Service:Ingestwithrpc Append(stream Record) returns (AppendAck)
- Package:
- Auth: mTLS + per-RPC auth via
authorizationmetadata; includex-tenant-id. - Backwards compatibility: fields never renumbered; only add new optional fields.
Caching & Conditional Requests¶
- Safe GETs return
ETag; clients may sendIf-None-Match→304for unchanged resources. - TTL hints via
Cache-Control: private, max-age=30for non-sensitive tenant reads. - Do not cache mutable or sensitive resources (billing actions, identity).
Rate Limiting & Quotas¶
- Enforced at gateway; mirrored by services to prevent bypass.
- Headers communicate quota state (see above).
- 429 includes
Retry-AfterandX-RateLimit-Policy(edition/plan identifier).
Idempotency & Concurrency¶
- Unsafe methods (
POST,PATCH,DELETE) acceptIdempotency-Key; server guarantees exactly-once effect within 24h window. - Optimistic concurrency:
If-Match: <ETag>required for updates; stale tag →412 Precondition Failed.
OpenAPI & SDKs¶
- Source of truth:
contracts/openapi/<service>-v{N}.yaml. - Linting rules:
- Must document auth requirements, tenant context, error responses (
4xx,5xx). - Pagination schema reusable (
PageInfo). - Examples for each operation with
tenantIdandtraceId.
- Must document auth requirements, tenant context, error responses (
- SDK generation: Typescript/.NET Python as needed; versioned per API major.
Deprecation & Sunset Workflow¶
- Mark endpoints/fields as deprecated in OpenAPI (
deprecated: true) and docs. - Notify via tenant/admin changelog events and status page.
- Emit deprecation headers from edge for affected routes.
- Observe usage via logs/metrics per version.
- Remove only after window closes and usage < threshold (e.g., <1% over 30 days).
Observability & SLOs¶
Traces
- Span names:
http <VERB> <route>; attributes:tenantId,edition,routeId,version,idempotencyKey(when present).
Metrics (per route & version)
http_requests_total,http_server_duration_seconds(histogram),http_4xx/5xx_total,rate_limit_hits_total.
Baselines
- p95 read ≤ 200 ms, write ≤ 350 ms; success ≥ 99.9%; schema/contract lint pass 100% in CI.
Risks & Trade-offs¶
| Risk/Decision | Trade-off | Mitigation |
|---|---|---|
| URI versioning proliferation | More routes to maintain | Contract governance, codegen, and routing templates |
| Cursor pagination complexity | Harder for simple clients | Provide helper SDKs and compatibility with offset |
| Strict idempotency semantics | Server storage for keys/outcomes | TTL window; compact storage; GC jobs |
| Long deprecation windows | Slower platform evolution | Telemetry-driven removal; enterprise opt-in support |
| Dual REST/gRPC posture | Two stacks to maintain internally | Use gRPC selectively; concentrate shared interceptors |
Solution Architect Notes¶
- Treat contracts as code: PRs that change APIs must update OpenAPI/Protobuf, examples, and consumer contract tests.
- Keep tenant context explicit; never infer from payloads alone.
- Prefer event notifications (webhooks/events) to polling—then design REST for discovery & control, not data streaming.
- Document SLOs and error budgets at the operation level for critical routes (auth, onboarding, billing).
Webhooks & Extensibility¶
Purpose¶
Provide a generic, reusable webhook capability for all SaaS solutions produced by the factory. Webhooks extend the platform to tenant systems and partner integrations without tight coupling, using signed deliveries, durable retries with ageing, idempotency, replay, and a self-service management API. The design is multi-tenant, edition-aware, and can be embedded as a shared microservice in other platforms.
Architecture Overview¶
Components
- Webhook Manager (API): Tenant-scoped CRUD for subscriptions, secrets, filters, delivery policies, and replay requests.
- Webhook Dispatcher (Worker): Consumes domain events, applies routing rules, signs payloads, and performs resilient delivery with backoff and DLQ.
- Subscription Store: Tenant-scoped definitions (endpoint URL, secret, filters, version, status).
- Delivery Store: Durable log of attempts/outcomes for replay, auditing, and idempotency.
- Schema Catalog: Canonical schemas (by event type/version) exposed for consumer validation/generation.
Event Flow
- Domain services publish events to the event bus.
- Dispatcher projects them against active subscriptions (per tenant) and enqueues delivery jobs.
- Delivery attempts use HMAC-signed HTTP calls; failures retry with exponential backoff + jitter until age limit reached.
- After max age/attempts, jobs are moved to DLQ; operators or tenants can replay.
flowchart LR
E[Domain Events (Service Bus)] --> D[Webhook Dispatcher]
D -->|match filters| Q[Delivery Queue]
Q -->|http POST| R[Receiver Endpoint]
R -->|2xx| DS[Delivery Store]
R -->|>=400| Q
D -->|exhausted| DLQ[Dead-Letter Queue]
M[Webhook Manager API] --> S[Subscription Store]
M --> DS
M -->|replay| Q
Delivery Contract¶
HTTP Method: POST
Content-Type: application/json (UTF-8)
Headers (required):
Webhook-Id: stable UUID for the subscription (not the tenant)Webhook-Event-Id: unique delivery id (UUID)Webhook-Event-Type: canonical type (e.g.,tenant.created.v1)Webhook-Event-Time: RFC3339 timestampWebhook-Signature:sha256=<hex>(HMAC over{timestamp}.{body})Webhook-Timestamp: UNIX seconds used in signatureWebhook-Retry-Count: attempt numberWebhook-Tenant-Id: source tenantTraceparent: W3C trace context for end-to-end correlation
Payload (envelope):
{
"id": "9f4a2d64-7c8a-4a25-9a0a-0f0e9b5dcb31",
"type": "tenant.created.v1",
"occurredAt": "2025-09-29T09:30:00Z",
"tenantId": "t-123",
"schemaVersion": "1",
"data": {
"name": "Acme",
"edition": "Standard"
}
}
Security
- Signature secret is per subscription and rotates without downtime (see Rotation).
- Receivers must validate timestamp freshness (e.g., ≤ 5 minutes drift) and recompute HMAC.
Subscription Model (multi-tenant)¶
| Field | Description |
|---|---|
subscriptionId |
Stable UUID |
tenantId |
Owner tenant (scopes visibility & quota) |
endpointUrl |
HTTPS only; optional mTLS |
secret |
HMAC secret (active) |
nextSecret |
Staged secret for rotation |
enabled |
true/false |
eventTypes |
Allow-list (supports wildcards: billing.*) |
filters |
Attribute predicates (e.g., edition in ['Enterprise']) |
deliveryPolicy |
Retries/backoff, timeouts, concurrency, ageing window |
rateLimit |
Per-subscription throttle (req/min) |
version |
Schema version preference (default: latest compatible) |
Delivery Semantics¶
- At-least-once. Receivers must be idempotent.
- Idempotency-Key:
Webhook-Event-Idprovided; consumers should persist outcomes keyed by it. - Retries: Exponential backoff with jitter (e.g., 1s, 4s, 16s, 64s, … up to policy cap).
- Ageing: Stop retrying when age limit reached (default 24h); event moved to DLQ.
- Replay: Tenants/operators can request targeted replays by
eventId, time window, or filter (respects age & legal constraints). - Ordering: Not guaranteed across topics; best-effort per subscription if the receiver returns promptly. For strict order, recommend event-sourced consumer pattern or per-aggregate subscriptions.
Webhook Management API (examples)¶
Create subscription
POST /v1/webhooks/subscriptions
{
"endpointUrl": "https://acme.example.com/hooks",
"eventTypes": ["tenant.created.v1", "billing.subscription.*"],
"filters": ["edition in ['Enterprise','Standard']"],
"deliveryPolicy": { "timeoutMs": 5000, "maxAttempts": 10, "maxAge": "24h" },
"rateLimit": { "rpm": 600 }
}
Rotate secret (staged)
Replay
POST /v1/webhooks/deliveries:replay
{
"subscriptionId": "…",
"from": "2025-09-28T00:00:00Z",
"to": "2025-09-28T23:59:59Z",
"filters": ["type like 'billing.%'"]
}
List deliveries
Security Model¶
- HTTPS required, optional mTLS for high-trust partners.
- HMAC-SHA256 signatures; body canonicalized as raw bytes.
- Secret Rotation: dual-secret window (active + next).
Webhook-Signatureincludes key id when dual-valid to disambiguate. - IP allow-lists (optional) and DNS pinning (resolve at delivery) to reduce SSRF risks.
- Payload hygiene: No PII unless explicitly enabled by tenant policy; sensitive fields can be tokenized.
Edition & Quota Integration¶
- Free: limited subscriptions (e.g., 1), low RPM, short retention.
- Standard: moderate RPM, standard retention, replay allowed.
- Enterprise: higher RPM/concurrency, longer retention, mTLS support, dedicated IPs.
- Custom: configurable caps per product/contract.
Quota enforcement happens in Webhook Manager and Dispatcher; gateway surfaces headers:
X-WebhookRate-Limit,X-WebhookRate-Remaining,X-WebhookRate-Reset.
Observability¶
- Tracing: Each delivery attempt creates a span:
webhook.deliver; attributes:tenantId,subscriptionId,endpointHost,attempt,statusCode,latencyMs. - Metrics:
webhook_deliveries_total{status=2xx/4xx/5xx}webhook_retry_attempts_totalwebhook_latency_ms(histogram)webhook_signature_failures_totalwebhook_replay_requests_total
- Logs: Structured, with redacted URLs/secrets, and correlation to source domain event via
sourceEventId.
Failure Modes & Playbooks¶
| Failure | Symptom | Action |
|---|---|---|
| Signature mismatch | 401/403 at receiver | Verify timestamp skew, secret, canonicalization; rotate secret if suspected leak |
| Persistent 5xx | DLQ growth | Backoff increase, notify tenant, open incident with receiver owner |
| Slow receiver | Timeouts, high latency | Apply per-subscription timeout; recommend receiver queueing |
| Endpoint drift | NXDOMAIN/SSL error | Suspend subscription; alert admin; require endpoint verification |
| Replay storm | Burst load on receiver | Rate limit replays; allow windowed replay; communicate schedule |
Consumer Guidance (best practices)¶
- Verify signatures and timestamp freshness before processing.
- Idempotent handlers using
Webhook-Event-Id; store processed IDs for TTL ≥ ageing window. - Respond fast (≤ 2s) and offload heavy work to your own queue.
- Return 2xx only when processing is safely enqueued; otherwise 4xx/5xx to trigger retry.
- Keep endpoints stable; use staged rotation flows for secrets and URL changes.
Extensibility & Reusability¶
- The service is product-agnostic: event routing is driven by type, filters, and tenant policy, not by hardcoded domains.
- Supports multiple schema versions per event type; consumers pick via subscription preference.
- Pluggable signing providers (HMAC default; optional asymmetric signing for advanced partners).
- Integrates easily into other platforms: expose Manager API, Dispatcher wired to their bus, and reuse Schema Catalog.
Solution Architect Notes¶
- Keep the Dispatcher stateless; rely on Delivery Store for idempotency and state.
- Treat webhooks as untrusted egress: validate destinations, cap concurrency, and isolate network egress where possible.
- Always surface tenant and subscription context in traces and logs to debug quickly.
- Encourage partners to adopt schema validation using the exposed Schema Catalog, and provide SDKs where adoption is strategic.
Admin Console & Tenant Portal¶
Purpose¶
Offer a reusable, multi-tenant UI that any generated SaaS product can embed or extend. The Admin Console and Tenant Portal provide self-service administration, usage/metering visibility, feature toggle management, audit exploration, and embedded observability. They enforce RBAC/ABAC, respect edition entitlements, and present a consistent UX across products.
Audience & UX Principles¶
- Tenant Administrator: configure identity/SSO, manage users/roles, set flags/overrides, view invoices and audit logs.
- Billing & Finance Operator: review subscriptions, invoices, usage reports.
- Support Agent: read-only diagnostics, tenant health, and audit traces (PII-redacted).
- End User: minimal profile/settings; app-specific modules only when permitted.
- Platform Operator (SRE): operator console views (environment health, tenant routing, migration status).
UX principles: least privilege, edition-aware UI (hide/disable), zero-trust by default (no local admin bypass), explainability (why a control is disabled), fast path to observability (one click to traces/logs filtered by tenant).
UI Architecture & Embedding¶
Composition approach
- Shell + Modules architecture (microfrontends optional). Shared shell provides auth, navigation, theming, telemetry, and locale.
-
Module types:
-
Core Modules (always present): Tenant Management, Users & Roles, Usage & Billing, Feature Flags & Overrides, Audit Explorer, Webhooks, Security & SSO, Observability.
- Product Modules (pluggable): domain-specific pages contributed by each bounded context (e.g., Projects, Catalog).
Embedding patterns
- Standalone (hosted by the platform) or Embedded (iframe/Web Component) inside product portals.
- RBAC-aware Navigation: menu items and routes emitted by module manifests and filtered by role/entitlements.
flowchart LR
Shell[Console Shell] --> Auth[OIDC Auth]
Shell --> Nav[RBAC Navigation]
Shell --> Telemetry[OTel Web SDK]
Shell --> Core[Core Modules]
Shell --> Product[Product Modules]
Core --> Tenants[Tenant Management]
Core --> Users[Users & Roles]
Core --> Flags[Feature Flags & Overrides]
Core --> Billing[Usage & Billing]
Core --> Audit[Audit Explorer]
Core --> Hooks[Webhooks]
Core --> SSO[Security & SSO]
Core --> Obs[Embedded Observability]
RBAC Mapping (illustrative)¶
| Area / Page | tenant_admin | billing_admin | support_agent | member | operator |
|---|---|---|---|---|---|
| Tenant Profile & Residency | ✓ | – | ✓ (read) | – | ✓ |
| Users & Roles | ✓ | – | – | – | ✓ (read) |
| Feature Flags & Overrides | ✓ | – | – | – | ✓ (read) |
| Usage & Billing | ✓ (read) | ✓ | ✓ (read) | – | ✓ (read) |
| Audit Explorer | ✓ (PII-masked) | – | ✓ (PII-masked) | – | ✓ |
| Webhooks | ✓ | – | – | – | ✓ (read) |
| Security & SSO | ✓ | – | – | – | ✓ (read) |
| Embedded Observability | ✓ | – | ✓ (read) | – | ✓ |
| Product Modules (domain) | ✓ | – | read if granted | read if granted | ✓ (read) |
ABAC examples: region constraint for PII views (tenant.region == user.region), data-class gates for audit payloads.
Key Modules & Responsibilities¶
Tenant Management
- View/update tenant profile, residency, isolation level (with guardrails).
- Lifecycle actions: suspend/resume, export, scheduled deletion workflow.
- Connection/partition info (read-only), migration status.
Users & Roles
- Invite users, assign roles, JIT provisioning visibility (SCIM for Enterprise).
- Session overview, forced logout, MFA settings (when supported).
Feature Flags & Overrides
- Browse flags by context; evaluate flag in a what-if panel (tenant/user/edition).
- Create tenant overrides with audit reason; kill-switches surfaced prominently.
- Safe rollout helpers (percentage/targeting where applicable).
Usage & Billing
- Usage charts (requests, storage, compute) with cursor pagination for raw events.
- Subscription plan, invoice history, payment method (link-out to provider portal via ACL).
- Quotas, rate limits; “close to limit” alerts and upgrade CTAs (edition-aware).
Audit Explorer
- Search by actor, action, resource, traceId, and date window.
- PII-aware rendering (tokenized values; reveal requires extra privilege).
- Export signed bundles (JSON/CSV) with chain of custody metadata.
Webhooks
- CRUD subscriptions, rotate secrets (dual-key), replay deliveries (windowed).
- Delivery attempts table, filter by status code, trace to source event.
Security & SSO
- Configure SSO (OIDC/SAML), SCIM endpoints, conditional access toggles (where provided via federation).
- API keys/pats (if allowed), with strict scopes and expirations.
Embedded Observability
- Embed pre-filtered dashboards (tenantId, edition).
- Trace search (“Follow a request”) and log tail with PII redaction.
- Error budget widgets aligned with tenant edition.
Data & Integration Flows¶
sequenceDiagram
participant UI as Admin Console
participant GW as API Gateway
participant TEN as Tenant Service
participant CONF as Config/Flags
participant BILL as Billing
participant AUD as Audit
participant OBS as Observability
UI->>GW: GET /v1/tenants/{id}
GW->>TEN: AuthZ + fetch tenant
TEN-->>GW: tenant profile (+residency, isolation)
GW-->>UI: profile
UI->>GW: GET /v1/flags?tenantId=...
GW->>CONF: list flags + overrides
CONF-->>GW: flags
GW-->>UI: flags
UI->>GW: GET /v1/usage?window=30d
GW->>BILL: aggregate usage
BILL-->>GW: usage series
GW-->>UI: charts
UI->>GW: GET /v1/audit?traceId=...
GW->>AUD: query
AUD-->>GW: results (PII-masked)
GW-->>UI: render
UI->>OBS: iframe embed with signed view token (tenant filter)
Tenancy, Security & Privacy¶
- Tenant context mandatory: every UI call carries
X-Tenant-Id; ambiguous → blocked at edge. - Edition awareness: modules/features hidden or disabled when not entitled; tooltips explain required edition.
- PII hygiene: redaction by default; View Sensitive Data requires elevated role + just-in-time approval with audit.
- Audit everything: admin actions, flag changes, replay requests, SSO changes—all emit audit events with actor and reason.
- Session security: short-lived access tokens with silent refresh; session fixation and clickjacking protections; CSP locked to allowed origins.
Observability & SLOs¶
- Front-end telemetry: OTel Web SDK emits
ui.actionandui.viewspans; correlates with backend viatraceparent. - User-centric SLOs
- Console availability ≥ 99.9%
- p95 nav-to-data render ≤ 500 ms (cached) / 1500 ms (cold)
- Feature toggle apply-to-effective latency ≤ 10 s
- Dashboards: per-tenant/module latency, error rates, UI Web Vitals (LCP/CLS), and admin risky-action counters.
Accessibility, Internationalization & Theming¶
- Accessibility: WCAG 2.1 AA baseline; keyboard-only navigation; screen-reader labels on critical controls.
- i18n: message catalogs; locale from user profile; number/date formatting per locale.
- Theming: light/dark + tenant branding (logo/accent); ensure contrast ratios remain compliant.
Performance & Resilience¶
- Code splitting per module; prefetch on likely nav paths.
- Client-side caching with ETags; optimistic updates for non-critical settings.
- Graceful degradation: if observability embedding is down, show cached summaries with a banner.
Extension Points¶
- Module Manifest: JSON that declares routes, required scopes, and nav entries; the shell loads and gates them.
- Action Hooks: before/after save for flags, webhooks, and SSO to inject product-specific validation.
- Deep Links: shareable URLs encoding tenant context and filters (
/audit?tenant=t-123&trace=...).
Solution Architect Notes¶
- Keep policy enforcement at API boundaries; the UI only reflects decisions, it doesn’t make them.
- Treat Feature Flags as UX surface for controlled rollout; every toggle change must be observable and reversible.
- Make embedded observability tenant-safe: scoped tokens, short TTL, and server-side filtering.
Data Lifecycle, Privacy & Compliance¶
Purpose¶
Define how data is classified, minimized, protected, retained, and erased across the platform. Establish privacy-by-design controls, tenant-aware retention/TTL, encryption standards, data-subject request (DSR) flows, and auditability. Provide compliance overlays (e.g., GDPR, HIPAA, SOC 2) that products can enable per tenant/edition or per industry pack—without code changes.
Data Classification¶
| Class | Examples | Handling Rules | Storage Defaults | Access |
|---|---|---|---|---|
| Public | Docs, status messages | Cacheable; no PII | Blob (public-read via signed URLs if needed) | Unrestricted |
| Internal | Feature metadata, templates | No external exposure; redact in logs | SQL/Blob | Staff with role gating |
| Confidential | Tenant config, usage metrics | TLS in transit, TDE at rest; masked in logs | SQL + Redis cache (TTL) | Tenant-scoped; RBAC/ABAC |
| PII | Name, email, IP, user identifiers | Minimize, pseudonymize where possible; strict audit | SQL (column-level protection) | Least privilege, purpose-bound |
| Sensitive PII | SSN, passport, health data | Tokenize or encrypt at field level; strong purpose limits | SQL with field-level encryption | Restricted roles; extra approvals |
| Secrets | API keys, webhooks secrets | Never in logs; KV only; rotate ≤ 90d | Key Vault, no app DB | Break-glass only |
| Audit/Legal | Audit events, access logs | Append-only; WORM retention; exportable | SQL/Blob (immutability options) | Operators/legal with controls |
Principles: collect the minimum, retain the minimum, process for stated purposes only, never log raw PII or secrets.
Data Minimization & Privacy Patterns¶
- Pseudonymization: replace direct identifiers with stable tokens; keep mapping in a protected table or vault.
- Selective collection: edition/pack toggles gate collection of optional attributes (e.g., precise location).
- PII scrubbing in telemetry: centralized log processors reject events containing PII markers; drop or tokenize.
- Scoped queries: APIs expose aggregated/filtered views by default; raw exports require elevated role + audit reason.
Retention & TTL Policies (defaults)¶
| Dataset | Default TTL/Retention | Disposal Method | Notes |
|---|---|---|---|
| Tenant profile/config | Active + 90 days after deletion | Hard-delete after grace | Soft-delete window for recovery |
| Usage raw events | 90 days | Roll-up + delete raw | Keep aggregates 13 months |
| Audit log | 7 years (configurable) | WORM archival then purge | Legal hold overrides TTL |
| Webhook deliveries | 30 days | Purge | Enterprise can extend (e.g., 90 days) |
| Job histories | 30 days | Purge | Metrics retained separately |
| Backups | 35 days rolling | Expire snapshots | Encrypted backups only |
Retention can be overridden by tenant contract (e.g., healthcare pack) and by regulatory overlays. All overrides are traceable.
Encryption & Key Management¶
- In transit: TLS 1.3 everywhere; mTLS inside the cluster/mesh.
- At rest: TDE for SQL/Blob; CMEK (customer-managed) optional per tenant/region.
- Field-level: Sensitive PII encrypted with per-field keys or envelope encryption; keys in Key Vault.
- Rotation: keys/secrets rotated ≤ 90d; certs tracked with expiry alerts; dual-key windows during rotation.
- No plaintext secrets in app configs or logs; secrets read at runtime via workload identity.
DSR (Data Subject Requests) & Erasure Flows¶
Supported requests: access/export, rectification, deletion (erasure), restriction (hold), portability.
sequenceDiagram
participant R as Requestor (Tenant Admin/User)
participant AC as Admin Console
participant GW as Gateway
participant DSR as DSR Orchestrator
participant SVC as Domain Services
participant AUD as Audit
R->>AC: Submit DSR (access/erasure)
AC->>GW: POST /v1/dsr
GW->>DSR: create_case(tenantId, subjectId, type)
DSR->>SVC: fanout(workflow tasks per context)
SVC-->>DSR: status/proofs (export bundle, erasure markers)
DSR-->>AC: progress events
DSR->>AUD: record actions (who/what/when/why)
AC-->>R: Completion & evidence package
Erasure rules
- Soft-delete first, then scheduled hard-delete after grace window unless legal hold.
- Propagate to caches, search indexes, and derived stores; rebuild affected aggregates.
- Audit trail remains (immutable), but direct identifiers are tokenized or replaced with non-reversible references.
Compliance Overlays (enable per tenant/product)¶
| Overlay | Scope | Key Controls |
|---|---|---|
| GDPR | EEA/EU residents | DSR flows, consent records, data minimization, cross-border transfer safeguards |
| HIPAA | PHI in US healthcare | BAAs, audit controls, access logs, encryption, breach notifications |
| SOC 2 | Trust Services Criteria | Change management, access control, incident response, monitoring |
| PCI DSS | Payments | No PAN storage or tokenize; provider ACL; quarterly scans |
| Data Residency | Contractual | Region-locked storage and processing, DR within region pair only |
Overlays are policy bundles (configs + CI checks + runtime gates) and can be attached by edition/industry pack.
Policy Enforcement Points¶
| Layer | Enforcement |
|---|---|
| Edge/Gateway | Consent headers, tenant residency checks, geo-fencing, edition policy hints |
| Service APIs | RBAC/ABAC checks for data classes; PII-safe serializers; explicit export endpoints |
| Domain Logic | Purpose-bound processing; blockers for collecting unnecessary attributes |
| Repositories | RLS/tenant guards; column protection; query allow-lists |
| Pipelines/CI | Static scans for PII in code/tests; schema linters; policy-as-code |
| Observability | PII detectors in logs; trace attributes marking data class; audit correlations |
Auditability & Evidence¶
- Immutable audit: append-only events for access, changes, exports, erasures, consent updates (actor, reason, traceId).
- Evidence bundles: machine-readable (JSON) + human-readable (PDF/CSV) export for DSR/compliance audits.
- Chain of custody: signed exports with hash manifest; storage with retention & legal hold.
Consent & Transparency¶
- Consent records: versioned, tenant-scoped; changes audited with reason.
- Notices: data use and tracking disclosures in product UIs; link to privacy policy versions.
- Telemetry consent: opt-in/opt-out flags where legally required; defaults by region.
Observability & SLOs (privacy-centric)¶
| Signal | Target | Notes |
|---|---|---|
| PII-in-logs detections | 0 | Fail CI and alert in prod |
| DSR completion time (access/erase) | ≤ 30 days (legal max), target ≤ 7 days | Per tenant |
| Key/cert rotation on time | 100% within policy | Alerts at 15/7/3 days pre-expiry |
| Residency violations | 0 | Block at edge; SEV-1 if detected |
| Audit write success | 99.99% | Buffer & backpressure during spikes |
Failure Modes & Mitigations¶
- Leak of PII in logs → Real-time detector trips; block sink; open incident; purge/redact; root-cause and tests added.
- Erasure partial → Re-run workflow on failed contexts; maintain compensation tasks for reindexing/caches.
- Residency drift (cross-region write) → Gate at edge; quarantine records; migrate back and audit.
- Key rotation failure → Fall back to previous valid key in dual window; open incident; rotate manually.
Solution Architect Notes¶
- Treat classification tags as part of the domain schema; they travel with events and telemetry.
- Default all products to minimal collection; require explicit ADR for any expansion in data scope.
- Keep privacy enforcement in code (not just policy docs): dedicated serializers, DTOs without PII by default, and safe logging wrappers.
- Build DSR orchestration once and reuse across all contexts; success depends on consistent IDs and comprehensive lineage of derived data.
Quality Strategy: Testing & Verification¶
Purpose¶
Ensure that every SaaS solution generated by the factory meets a consistent quality bar before release. Define a layered testing strategy (unit, integration, contract, end-to-end, testing-in-prod), establish standards for fixtures and synthetic checks, and enforce quality gates in CI/CD pipelines. Quality must be measurable, automated, and edition/tenant-aware.
Testing Layers¶
Unit Tests
- Scope: individual classes, functions, helpers.
- Isolation: no I/O; mocks/stubs for dependencies.
- Coverage: critical business logic, edge cases, input validation.
- Tools: xUnit/MSTest + mocking libraries.
Integration Tests
- Scope: service + real dependencies (DB, message bus, cache).
- Run in ephemeral test environment or with containerized dependencies (Docker Compose).
- Verify data persistence, event publishing, cache consistency.
- CI: run after unit tests, on every PR.
Contract Tests
- Consumer-Driven Contracts (CDC) for REST/gRPC/event contracts.
- Validate request/response schema, headers (
tenantId,traceId,edition), and error envelopes. - Ensure backward compatibility when evolving APIs.
- Tools: Pact or custom CDC harness for events.
End-to-End (E2E) Tests
- Full user flows (sign-up → tenant onboarding → subscription → flag evaluation).
- Run against preview environments per PR and nightly in staging.
- Browser/UI tests for tenant portal & admin console (Playwright/Selenium).
- Must verify multi-tenancy invariants (no cross-tenant leaks).
Testing in Production (TiP)
- Synthetic checks: simulate user sign-in, onboarding, and critical APIs continuously from multiple regions.
- Canary releases verified by synthetic flows before general rollout.
- Guardrails: PII-free test tenants with synthetic data only.
Synthetic Checks & Monitoring¶
- Synthetic tenants: seeded with representative data (usage, billing, flags, notifications).
- Continuous checks:
- Auth & token issuance
- Tenant onboarding latency
- Flag evaluation correctness
- Billing usage metering
- Webhook delivery (to test sink)
- Alerting: failures trigger incident; linked to error budgets.
Test Data & Fixtures¶
- Fixture library with tenant archetypes (SMB, Mid-market, Enterprise).
- Default seeds: product editions (Free/Standard/Enterprise), feature flags, subscriptions.
- Synthetic PII: generated datasets for GDPR-safe testing (never real customer data).
- Replayable scenarios: standard JSON/SQL seeds for onboarding, billing events, AI orchestration.
- Secrets: fake keys/tokens for tests; real secrets only in prod via Key Vault.
Quality Gates in CI/CD¶
- Unit test coverage: ≥ 80% for critical domains, reported in PR.
- Integration/contract tests: must pass on PR merge; CDC verified against consumers.
- Static analysis: code quality, security scans (PII detection, secret scanning).
- E2E smoke tests: run on preview env per PR; block merge if failures.
- Performance regression checks: latency budgets per context enforced in load tests.
- Observability smoke tests: ensure traces/logs/metrics present in preview env.
Observability of Quality¶
- Dashboards show test pass rate trends, time-to-fix broken builds, and coverage by bounded context.
- Trace synthetic flows with special tenant IDs (
synthetic-*) for easy filtering. - Alerts if coverage drops below baseline or if repeated flaky tests exceed threshold.
Risks & Mitigations¶
| Risk | Impact | Mitigation |
|---|---|---|
| Flaky E2E tests | Erode trust, block CI | Quarantine flaky tests; retry-once policy; root-cause analysis required |
| Synthetic checks masking real issues | Blind spots | Expand test tenant archetypes; rotate scenarios regularly |
| Contract drift | Breaking downstream consumers | Enforce CDC tests in CI; require ADR for breaking changes |
| PII in test data | Compliance risk | Always use synthetic/fake data; automated scanners in CI |
Solution Architect Notes¶
- Build one reusable test harness per layer (unit, integration, CDC, E2E) and apply it across all generated services.
- Treat synthetic tenants as production monitoring assets: they validate real infrastructure and SLOs continuously.
- Quality is enforced by automation, not heroics; broken builds must be fixed before merge.
- Integrate quality dashboards into the Admin Console for operator visibility (test status, SLO compliance, error budgets).
CI/CD & Release Strategy¶
Purpose¶
Define a standardized, reusable CI/CD blueprint for all SaaS products generated by the factory. The strategy ensures consistent build → test → promote → release flows with policy gates, preview environments, artifact signing, and progressive delivery patterns. It provides ready-to-use pipeline templates in Azure DevOps, extendable to GitHub Actions or other CI/CD platforms.
Branching & Source Strategy¶
- Main Branch (
main) Always releasable; mirrors production environment. - Feature Branches (
feature/*) Used for development; PRs required for merge; auto-preview environment deployment. - Release Branches (
release/*) Optional; for staged rollouts, LTS or regulated products. - Hotfix Branches (
hotfix/*) Patch critical issues in production; merge back tomainanddevelop.
Policy: All merges to main require passing quality gates (tests, security scans, coverage, lint).
Pipeline Topology¶
Stages
-
Build
- Restore dependencies, compile, lint, unit test, coverage report.
- Build container images (multi-arch if needed).
- Sign artifacts and produce SBOM (Software Bill of Materials).
-
Test
- Integration + contract tests against containerized dependencies (SQL, Service Bus, Redis).
- Security scans (SAST, dependency CVE checks, IaC validation).
- Quality gates: block on failing tests, PII detectors, or critical vulnerabilities.
-
Package & Publish
- Push signed images to Azure Container Registry (ACR).
- Publish Helm charts, IaC bundles (Bicep/Pulumi), and API contracts (OpenAPI/Protobuf).
-
Deploy (Preview/Stage/Prod)
- Deploy via Helm to ACA/AKS depending on target.
- Preview environments per PR (auto-provisioned; ephemeral).
- Stage mirrors production topology for final validation.
-
Release
- Progressive delivery patterns (blue/green, canary).
- Feature flag toggles for dark launches.
- Automatic rollback if error budget breached.
Artifact Flow¶
flowchart LR
Code[Source Repo] -->|PR| Build
Build --> Test
Test --> Package
Package --> ACR[(Azure Container Registry)]
Package --> Helm[Helm Charts Repo]
ACR --> StageEnv[Staging Env]
Helm --> StageEnv
StageEnv --> ProdEnv[Production Env]
ProdEnv --> Users[Tenants]
- Immutable artifacts: all builds produce signed, versioned images (
app:1.2.3+commit.sha). - SBOM & provenance: stored in artifact registry; verifiable against policy gates.
Preview Environments¶
- Every PR spins up an ephemeral namespace with isolated ingress (
pr-123.factory.example.com). - Uses feature branch images + seeded synthetic tenants.
- Destroyed automatically on PR close/merge.
- Includes OTel telemetry, synthetic flows, and seeded test data.
Policy Gates¶
Enforced in pipelines and at merge time
- Linting & Static Analysis: must pass with no critical issues.
- Unit/Integration Coverage: ≥ 80% for critical services.
- SAST & Dependency Scans: block on CVEs above CVSS 7.
- Contract Tests: must pass against consumer suites.
- Compliance Checks: PII/secret scanning, license compliance.
- Artifact Signing: unsigned artifacts rejected downstream.
Progressive Release Patterns¶
- Blue/Green Deploy new version alongside old, switch traffic via Gateway route. Rollback = switch back.
- Canary Route fraction of traffic (e.g., 5%/10%/25%) to new version; auto-promote or rollback based on SLOs.
- Dark Launch Release behind feature flag; only test tenants see feature until flag flipped.
- Ring-Based Rollout by tenant tier (internal → Free tenants → Standard → Enterprise).
Environment Topology¶
- Dev: local + CI ephemeral test containers.
- Preview: per PR; ephemeral; seeded with synthetic tenants.
- Stage: mirrors Prod, stable integration tests; nightly E2E.
- Prod: multi-region active/active for critical services, edition-aware quotas.
Observability in CI/CD¶
- Pipelines emit telemetry (duration, pass/fail, flaky tests).
- Deployment traces tagged with commit SHA, PR ID, and tenant context for preview.
- Release dashboards: SLO compliance, error budgets, rollout progress.
Risks & Mitigations¶
| Risk | Impact | Mitigation |
|---|---|---|
| Long-lived preview envs | Cost, stale tenants | TTL auto-cleanup, manual extend flag |
| Flaky tests blocking merges | Slows releases | Retry-once policy, flaky test quarantine |
| Progressive release misconfig | Partial outage | Automated rollback, rollback runbooks |
| Supply chain attack | Compromise of dependencies | Artifact signing + SBOM + dependency scanning |
Solution Architect Notes¶
- Treat pipelines as product templates: every generated SaaS inherits the same CI/CD skeleton.
- Keep artifact promotion immutable: rebuilds are not allowed between Stage and Prod.
- Bake error budgets into progressive rollout logic (auto-stop rollout if violated).
- Preview environments are not optional—every PR must deploy to ensure shift-left validation.
Infrastructure & Platform Runtime¶
Purpose¶
Define a portable, Azure-first runtime topology for the SaaS Factory, covering compute (AKS/ACA), messaging (Service Bus), data (Azure SQL + optional Mongo/Redis/Blob), identity (Managed Identities, Key Vault), networking (private endpoints, WAF), and observability (OTel → Prometheus/Grafana/Logs). This baseline is multi-tenant aware, enforces zero trust, and is designed to be materialized via IaC (Pulumi/Bicep).
Resource Topology (high-level)¶
flowchart TB
subgraph Edge
AFD[Azure Front Door + WAF]
end
subgraph App
AKS[(AKS)]:::core
ACA[(Azure Container Apps)]:::core
GATE[API Gateway (YARP)]
SVC[Core Microservices]
JOBS[Jobs/KEDA/Hangfire]
end
subgraph Data
SQL[(Azure SQL PaaS)]
MONGO[(MongoDB Atlas/AKS opt)]
REDIS[(Azure Cache for Redis)]
SB[(Azure Service Bus)]
BLOB[(Azure Blob Storage)]
KV[(Azure Key Vault)]
end
subgraph Observability
OTL[OTel Collectors]
LOGS[Log Analytics / Storage]
GRAF[Prometheus/Grafana]
end
AFD -->|TLS 1.3| GATE
GATE -->|mTLS| SVC
SVC -->|AMQP| SB
SVC -->|ADO.NET| SQL
SVC -->|Redis| REDIS
SVC -->|Blob SDK| BLOB
SVC -.-> MONGO
JOBS --> SB
AKS --> OTL
ACA --> OTL
OTL --> LOGS
OTL --> GRAF
SVC --> KV
GATE --> KV
classDef core fill:#0b6,stroke:#094,color:#fff;
Compute Runtime¶
AKS (Kubernetes)
- Best for high scale, service mesh (mTLS), advanced network policies, sidecars (e.g., OTel, authz), and custom autoscaling.
- Use managed identities for pods, CNI with Calico (or Azure CNI), and PodSecurity standards (baseline/restricted).
ACA (Azure Container Apps)
- Simpler ops; KEDA-native autoscaling on HTTPQ/CPU/Service Bus lag.
- Ideal for jobs, event processors, and smaller footprints.
- Can run alongside AKS (hybrid) for jobs plane and burst capacity.
Baseline rule: same container images and contracts run on both; runtime choice is an operational concern per product/environment.
Networking & Identity¶
- Perimeter: Azure Front Door + WAF terminates public TLS; only the Gateway is internet-exposed.
- Private Networking: services/data in private subnets; deny by default.
- Private Endpoints: Azure SQL, Service Bus, Storage, and Key Vault via Private Link; no public exposure.
- mTLS Everywhere: Gateway↔Services and Service↔Service; cert rotation automated.
- Workload Identity: AKS/ACA workloads use Managed Identities; no static secrets in pods.
- Egress Control: user-defined routes + firewall for outbound; webhook egress allow-lists.
Data Layer¶
-
Azure SQL (Primary Store)
- Pooled tenancy by default (RLS or repository guards).
- Geo-replication, automated backups, PITR; TDE + optional CMEK.
- Per-context schemas with strict ownership; migrations gated in CI.
-
MongoDB (Optional)
- For document-heavy payloads (e.g., notification templates).
- Private peering or self-hosted on AKS with operator; encryption at rest.
-
Redis
- Low-latency flag evaluation, token/claims cache, idempotency windows.
- TLS-only; tenant-prefixed keys; eviction policies controlled per module.
-
Blob Storage
- Artifacts, exports, evidence bundles, AI outputs.
- Container per context; immutability policies for audit exports.
Messaging Layer¶
- Azure Service Bus
- Topics/queues per bounded context; DLQs enabled; MaxDeliveryCount tuned per workload.
- MassTransit implements outbox/inbox, retries with jitter, and saga orchestration.
- Namespace partitioning (prod vs non-prod; per region).
- Private Endpoints; role assignments to workloads via managed identity.
Observability Stack¶
- OpenTelemetry Collectors as DaemonSet/sidecar (AKS) or as ACA app for ingestion.
- Metrics to Prometheus (managed or self-hosted), dashboards in Grafana.
- Logs to Log Analytics + long-term archive to Blob.
- Trace context propagated via W3C (
traceparent), with mandatory attributes (tenantId,edition). - Alerting: SLO-based alerts wired to incident channels; error budgets shown in dashboards and Admin Console.
Security & Compliance Baselines¶
- Zero Trust: no implicit trust; all traffic authenticated and authorized.
- Key Vault for secrets, keys, and certs; automated rotation; Just-In-Time access for operators.
- Container security: image signing/SBOM; policy gates deny unsigned or CVE-violating images.
- Data residency: per-tenant region enforced at edge; DR within paired region only.
- Audit: append-only audit store; WORM policies for exports.
Environments & Promotion¶
- Dev/Preview: ephemeral namespaces per PR; seeded synthetic tenants.
- Stage: mirrors prod; shadow/canary experiments; contract & load testing.
- Prod: multi-zone AKS/ACA; optional active-active across regions for critical services.
- Promotion: immutable artifacts from ACR; Helm/OPA policies guard releases.
Autoscaling & Resilience¶
- HTTP services: HPA (AKS) on CPU/RPS/p95 latency; ACA on RPS/queue lag.
- Workers: KEDA triggers on Service Bus depth/lag; DLQ replayers isolated.
- Resilience policies: timeouts, retries (idempotent), circuit breakers, bulkheads defined per dependency.
- Chaos: scheduled chaos experiments in non-prod; steady-state SLOs verified.
Resource Graph (baseline components)¶
| Area | Azure Resource |
|---|---|
| Edge | Front Door Standard/Premium, WAF Policy, Public DNS |
| Compute | AKS (node pools per workload class), ACA Environment |
| Networking | VNets, Subnets, Private DNS Zones, Firewall/UDR |
| Identity | Managed Identities, Key Vault |
| Messaging | Service Bus Namespace (Topics/Queues, DLQ) |
| Data | Azure SQL (Geo-repl), Redis, Blob Storage, optional Mongo |
| Observability | Log Analytics, Managed Prometheus/Grafana, OTel Collectors |
| Security | Defender for Cloud policies, Microsoft Entra ID Conditional Access (for ops) |
IaC Strategy (Pulumi-first; Bicep optional)¶
- Stacks per environment:
dev,stage,prod, with region parameters and SKU right-sizing. - Composition: core platform stack (network, identity, observability) + workload stacks (bus, data, compute).
- Policy-as-code: OPA/Conftest in CI to validate IaC against security/compliance baselines.
- Outputs: connection endpoints via Private Link, DNS zones, MI principal IDs, and signed URLs for bootstrap.
Failure Modes & Recovery¶
- Regional Service Bus degradation → autoscale consumers, buffer at producers with backpressure; consider cross-namespace failover for enterprise-grade products.
- SQL hotspot → promote tenant to schema/DB; enable read replicas for heavy reads; cache hot lookups in Redis.
- Key Vault throttling → application-side caching with TTL; staggered rotation windows.
- Front Door anomaly → failover to secondary profile; serve maintenance page; keep Admin APIs internal-only until edge recovers.
Solution Architect Notes¶
- Start small with ACA where possible; graduate to AKS for mesh, sidecars, or complex multi-tenant scaling.
- Keep Private Link ubiquitous—do not allow public endpoints on data plane resources.
- Make observability a dependency: deploy OTel & dashboards before tenant-facing services.
- Enforce workload identity end-to-end; any secret-based exception must be time-bound with an explicit waiver and monitoring.
FinOps & Cost Governance¶
Purpose¶
Ensure the SaaS Factory produces solutions that are cost-efficient, scalable, and predictable across tenants and editions. This requires embedding quotas, autoscaling rules, budget alerts, and cost dashboards into the baseline platform. By enforcing edition-aware policies, the factory guarantees fairness, prevents noisy-neighbor risks, and aligns operating costs with business revenue models.
FinOps Principles¶
- Cost visibility by tenant/edition: every workload and resource tagged with
tenantId,edition,productId. - Guardrails over gates: developers can ship features quickly but cannot bypass budget alerts or quota checks.
- Edition-aware scaling: Enterprise tenants receive higher quota ceilings; Free tenants constrained by default.
- Performance and cost together: every feature must have a defined capacity model and cost impact.
Capacity & Scaling Model¶
- Compute (AKS/ACA)
- Autoscale on CPU, RPS, and queue depth (KEDA).
- Quotas mapped to edition (e.g., Free = 1 vCPU cap, Enterprise = 8 vCPU burst).
- Messaging (Service Bus)
- Namespace throughput units allocated by tenant tiers.
- DLQ growth alerts trigger cost/performance investigation.
- Data (SQL, Blob, Redis)
- Tenant-level caps (rows, storage GB) with upgrade paths.
- Elastic pools for pooled tenants; isolated instances for Enterprise.
- Observability
- Log retention tiered: Free = 7d, Standard = 30d, Enterprise = 90d+.
- High-cost metrics (e.g., detailed tracing) reserved for Enterprise or by opt-in flag.
Edition-Aware Quotas¶
| Resource | Free | Standard | Enterprise |
|---|---|---|---|
| API Rate Limit (req/min) | 60 | 600 | 3000 |
| Storage (GB) | 1 | 50 | 500+ |
| Concurrent Jobs | 5 | 50 | 200 |
| Webhooks | 1 sub, 7d logs | 5 subs, 30d logs | 20 subs, 90d logs |
| Observability Retention | 7 days | 30 days | 90 days |
Quotas are enforced at gateway, service API guards, and DB-level policies.
Cost Dashboards & Tagging¶
- Tagging policy: all resources tagged with
env,product,tenantId,edition. - Dashboards: Azure Cost Management + Grafana integration, filtered by product/tenant/edition.
- KPIs tracked:
- Cost per tenant per month
- Cost per active user (CPU hours / requests)
- Cost of observability (log/metric ingestion rates)
- Idle resource cost vs. active usage
Budget Alerts & Guardrails¶
- Budgets set per environment (Dev/Stage/Prod) and per product.
- Alerts triggered at 50/75/90/100% thresholds.
- Automated Slack/Teams notifications to product teams.
- Stopgap policies: throttle non-critical workloads if spend breaches thresholds in Free/Trial environments.
Performance Budgets¶
- Each service defines baseline throughput, latency, and cost envelope.
- Perf tests run as part of release process:
- Tenant onboarding latency ≤ 60s (p95)
- API p95 latency ≤ 200 ms (reads) / 350 ms (writes)
- Cost per 1000 requests within target band (< $0.05 for pooled tenants)
- Perf regression gates in CI/CD: load tests in stage environment with synthetic tenants; fail build if >10% degradation.
Observability for FinOps¶
- Traces include
tenantId+ cost-relevant metrics (CPU time, DB queries, cache hits). - Metrics:
cost_per_tenant,quota_consumed_percent,autoscale_events. - Alerts:
quota_consumed_percent > 90%for Standard/Enterprise → notify tenant admin via portal.autoscale_events > threshold→ investigate cost/perf drift.
Risks & Mitigations¶
| Risk | Impact | Mitigation |
|---|---|---|
| Noisy neighbors in pooled DB | Performance degradation | Quotas, RLS, hot-tenant promotion to schema/db |
| Cost overrun in observability | High OPEX | Tiered retention, sampling, enterprise opt-in |
| Under-provisioned Enterprise tenant | SLA breach | Autoscale headroom, proactive capacity tests |
| Free tenants gaming system | Abuse of resources | Strict quotas, rate limits, throttling policies |
Solution Architect Notes¶
- Bake FinOps into templates: every service scaffold includes quota configs, cost tags, and perf test harness.
- Expose cost visibility to tenants: portal shows per-tenant usage and cost breakdowns (transparency + upsell driver).
- Use synthetic perf tenants in stage to continuously model cost-per-tenant at scale.
- Cost + performance are first-class SLOs: treat regressions in either as blockers to release.
Governance, ADRs & Change Management¶
Purpose¶
Establish a structured governance model for architectural decisions, API contracts, and platform evolution. The goal is to ensure that every SaaS product generated by the factory is traceable, reviewable, and predictable in its decision-making, while allowing for safe innovation through structured change management.
ADR Process & Repository¶
- Format: log4brains ADRs in Markdown (
docs/adr/). - Structure:
- Title & ID (incremental,
0001-title.md) - Status: Proposed / Accepted / Superseded / Rejected
- Context → Decision → Consequences → Alternatives → References
- Title & ID (incremental,
- Lifecycle:
- Draft ADR opened with PR.
- Peer review via Architecture Guild or Review Board.
- Accepted ADR merged → published in ADR site.
- If superseded, new ADR links back with rationale.
- Templates: ADR template provided in
docs/templates/adr-template.md.
Decision Lifecycle¶
- Trigger: new requirement, tech evaluation, incident, compliance need.
- Proposal: ADR draft with context/problem/alternatives.
- Review: async discussion in PR; optional Architecture Guild sync.
- Decision: maintainers approve/merge ADR.
- Implementation: tracked via linked Azure DevOps epics/features.
- Sunset: decision reviewed after 12–18 months, or upon major platform change.
Principles:
- One decision per ADR.
- ADRs are immutable once accepted (except metadata).
- Supersession, not deletion.
Review Boards¶
- Architecture Guild: senior engineers, product architects, security/privacy officers.
- Review cadence: weekly triage; quarterly backlog review.
- Charter:
- Maintain decision traceability.
- Balance innovation with stability.
- Escalate major trade-offs (cost, compliance, security).
- Voting model: consensus preferred, fallback majority.
Contract Versioning & Deprecation¶
API Contracts (REST/gRPC/events)
- Source of truth in
contracts/folder (OpenAPI/Protobuf/JSON Schemas). - Versioning:
- REST:
/v1/...,/v2/... - gRPC:
package service.v1 - Events:
eventName.v1,eventName.v2
- REST:
- Compatibility: additive changes allowed in same version; breaking changes → new major.
Deprecation & Sunset Policy
- Announcement: contract flagged as
deprecatedin spec + docs. - Headers:
Deprecation,Sunset, andLinkto changelog returned in responses. - Timeline:
- Minimum 12-month overlap for deprecated APIs.
- Enterprise tenants can negotiate extended support.
- Telemetry monitors usage of deprecated versions.
- Removal: only after telemetry shows <1% usage + published sunset date passed.
Governance & Change Management¶
- Config changes: go through Config Change Review (CCR) process, with audit trail in Git + change tickets.
- Schema changes: must be backward compatible; validated in CI via migration tests.
- Secrets/keys: rotated via automation; ADR required if new secret store introduced.
- Edition policies: changes to edition entitlements logged as ADRs + documented in release notes.
- Emergency changes: handled via expedited ADR (lightweight doc, ratify post-incident).
Observability of Decisions¶
- ADR site (log4brains + MkDocs Material) embedded in developer portal.
- Decision dashboards: show active ADRs, deprecated ones, and upcoming sunsets.
- Traceability: every epic/feature in Azure DevOps links to one or more ADR IDs.
Risks & Mitigations¶
| Risk | Impact | Mitigation |
|---|---|---|
| Decision sprawl | Inconsistent approaches across services | Central ADR repo + guild oversight |
| Slow reviews | Blocks delivery | Async PR reviews + time-boxed discussions |
| Deprecated APIs in use | Security/compliance gaps | Telemetry-driven enforcement; forced sunset dates |
| Lack of traceability | Regulatory audit failures | ADR portal + linked DevOps work items |
Solution Architect Notes¶
- Treat ADRs as first-class artifacts: they evolve the platform as much as code.
- Encourage lightweight but frequent ADRs — better many small scoped decisions than one bloated doc.
- Deprecation without telemetry is a blind spot: enforce contract usage dashboards.
- Change governance should be templatized in the factory so every new product inherits the same discipline.
Risk Register, Security Exceptions & Waivers¶
Purpose¶
Establish a centralized, continuously updated risk register for the SaaS Factory and its generated products. Define a matrix-based scoring model (likelihood × impact), provide structured mitigation plans, and standardize the handling of security exceptions and waivers. This ensures risks are visible, time-bound, and owned, while maintaining regulatory and audit readiness.
Risk Matrix Model¶
Scoring Dimensions
- Likelihood: Rare (1), Unlikely (2), Possible (3), Likely (4), Almost Certain (5)
- Impact: Negligible (1), Minor (2), Moderate (3), Major (4), Critical (5)
Risk Level = Likelihood × Impact
| Score | Level | Response |
|---|---|---|
| 1–4 | Low | Accept; monitor |
| 5–9 | Medium | Mitigate or accept with waiver |
| 10–16 | High | Mitigation required, track in DevOps |
| 17–25 | Critical | Immediate remediation, exec visibility |
Example Risk Register¶
| ID | Risk Description | Likelihood | Impact | Score | Owner | Mitigation | Status |
|---|---|---|---|---|---|---|---|
| R-001 | PII leakage in logs | 3 | 5 | 15 (High) | Security Eng | OTel PII scrubbing, CI scanners, redaction lib | Mitigation in progress |
| R-002 | Multi-tenant data bleed (RLS misconfig) | 2 | 5 | 10 (High) | DB Architect | Row-level security tests, tenantId invariant, contract tests | Open |
| R-003 | Stale TLS certs (expired) | 3 | 3 | 9 (Medium) | Ops Lead | Automated cert rotation, alerts 30/7/3 days | Closed |
| R-004 | Dependency CVEs in OSS libs | 4 | 4 | 16 (High) | Eng Enablement | Renovate/Dependabot, CVE scanning in CI | Ongoing |
| R-005 | Excess observability costs (log storm) | 4 | 2 | 8 (Medium) | FinOps | Sampling, quotas, alerting | Open |
Security Exceptions & Waivers¶
Workflow
- Request: Team submits exception form with context, rationale, alternatives considered.
- Review: Security Board evaluates risk level and business justification.
- Approval: If granted, exception logged with expiry date and remediation plan.
- Tracking: Exception ID linked to DevOps work item; progress reviewed monthly.
- Expiry: Automatic reminders 30/7/1 days before expiration; must renew or close.
Waiver Metadata (template)
- Waiver ID:
SEC-WVR-2025-001 - Requested by: Team/Owner
- Scope: service/context affected
- Risk reference(s): R-001, R-004
- Justification: why no immediate mitigation possible
- Expiry date: e.g., 90 days
- Mitigation plan: concrete steps + timeline
- Approver: Security Officer / Architecture Guild
Exception Categories¶
- Technical Debt: Legacy library pending replacement.
- Operational Constraint: Vendor service does not yet support mTLS.
- Regulatory Delay: Awaiting legal guidance before applying stricter residency policy.
- Business Urgency: Feature launch requires temporary compromise (time-boxed).
Governance & Review¶
- Security & Architecture Board owns risk register reviews.
- Monthly: review new risks, open waivers, and expired waivers.
- Quarterly: risk posture reassessment; report to exec sponsors.
- Audit: risk register and waiver log stored in Git (
docs/governance/risks.md,docs/governance/waivers.md).
Observability of Risk Posture¶
- Dashboards track:
- Open risks by severity.
- Exceptions expiring in next 30 days.
- Distribution of risk categories (security, compliance, cost, operational).
- Alerts: Slack/Teams notification for critical risk creation or waiver expiry.
Risks & Mitigations (meta-level)¶
| Meta Risk | Mitigation |
|---|---|
| “Paper-only” register (unused) | Integrate with DevOps epics; risk must have linked tasks |
| Perpetual waivers | Force expiry; auto-archive stale exceptions |
| Inconsistent scoring | Use standard matrix; train reviewers; cross-review |
| Lack of visibility | Dashboards & ADR references in portal |
Solution Architect Notes¶
- Waivers are not forever: each must be tied to a remediation timeline and sunset.
- The risk register should evolve with ADRs—a major architectural decision should consider risk impact explicitly.
- Treat the register as live operational telemetry, not a static doc.
- Encourage engineers to proactively raise risks; the cost of under-reporting is far higher than managing a larger register.
Rollout & Tenant Onboarding / Cutover¶
Purpose¶
Provide a standardized, low-risk rollout framework for new products, editions, and tenants. Define phased rollout strategies, onboarding workflows, migration/cutover playbooks, backout strategies, and communication templates. Ensure tenant experience is consistent, compliant, and reversible in case of failures.
Rollout Strategy¶
Phased Rollouts
- Internal First: internal tenants (synthetic + staff tenants) validate production infrastructure.
- Pilot Tenants: 3–5 selected customers (Enterprise or early adopters).
- Regional Expansion: enable per geography or residency zone.
- General Availability: open onboarding to all editions.
Progressive Controls
- Feature flags gate new capabilities.
- Canary deployments split traffic by tenant cohort or edition tier.
- Automated rollback if SLOs violated during rollout.
Tenant Onboarding Workflows¶
New Tenant Onboarding
- Tenant Admin signs up (via portal or API).
- Tenant entry created in Tenant Service (pooled by default).
- Isolation model applied (pooled/schema/db).
- Default edition + entitlements assigned.
- Baseline config + feature flags seeded.
- Synthetic smoke tests run under new tenant ID.
- Tenant marked “active” in registry.
Migration / Cutover Flow
- Trigger: product upgrade, edition change, residency migration.
- Steps:
- Freeze writes (short downtime window if needed).
- Export + transform data (schema/edition-aware).
- Import into new store (schema/db).
- Validate with synthetic checks (data count, checksum, smoke tests).
- Switch traffic at gateway (new edition endpoints).
- Monitor SLOs; keep old infra on standby for rollback window.
sequenceDiagram
participant Admin as Tenant Admin
participant Portal as Portal/API
participant TenantSvc as Tenant Service
participant Config as Config/Flags
participant Billing as Billing Service
participant Audit as Audit Log
participant Ops as Ops/SRE
Admin->>Portal: Sign-up / Edition change
Portal->>TenantSvc: CreateTenant/Upgrade
TenantSvc->>Config: Seed baseline flags
TenantSvc->>Billing: Assign subscription plan
TenantSvc->>Audit: Record onboarding/migration
Ops->>TenantSvc: Run synthetic tests
TenantSvc-->>Portal: Tenant ready (status=Active)
Migration & Cutover Runbooks¶
-
Pre-Migration Checklist
- Validate backups + PITR windows.
- Confirm tenant metadata (ID, edition, residency).
- Schedule migration window with comms.
- Run dry-run migration in staging.
-
Cutover Checklist
- Freeze tenant write operations.
- Take backup/snapshot.
- Run export → import pipeline.
- Validate data integrity + run smoke tests.
- Switch DNS/gateway route.
- Monitor telemetry dashboards for anomalies.
-
Backout Plan
- Rollback DNS/gateway to old infra.
- Restore data snapshot.
- Notify tenant of rollback and ETA for retry.
Communication Templates¶
-
Pre-Onboarding Email: “Your tenant environment is being provisioned. Expect ~15 minutes before services are active. You will receive confirmation when onboarding completes.”
-
Migration Notice: “We are upgrading your tenant to a new edition. A brief read-only window will occur from 02:00–02:30 UTC. Data integrity and continuity are guaranteed. In case of issues, we will revert within 30 minutes.”
-
Completion Notice: “Migration successful. All services are active. Please validate your workflows. If you encounter any issues, contact support with trace ID
.” -
Rollback Notice: “We reverted your tenant migration due to anomalies detected. Services remain available on the prior edition. We will reschedule and notify you once stability is confirmed.”
Observability & Readiness¶
- Readiness checks: synthetic tenant login, flag evaluation, billing call, event publishing, audit log write.
- Telemetry dashboards: show onboarding duration, failure rate, rollback events.
- SLOs:
- Onboarding p95 ≤ 10 minutes.
- Migration success ≥ 99.5% without manual intervention.
- Rollback readiness ≤ 30 minutes.
Risks & Mitigations¶
| Risk | Impact | Mitigation |
|---|---|---|
| Data mismatch post-migration | Tenant disruption | Checksums + synthetic tests before traffic cutover |
| Rollback failure | Extended outage | Immutable backups, DNS-based rollback, ops runbooks |
| Communication gaps | Customer dissatisfaction | Standard comms templates; automated notifications |
| Edition misassignment | Wrong entitlements | Config seeding automation; validation against edition matrix |
Solution Architect Notes¶
- Treat onboarding as code: workflows defined in automation (Pipelines/Functions), not manual ops.
- Every migration must have a rollback plan validated in non-prod.
- Prefer progressive cutover (tenant cohorts) over “big bang” migrations.
- Expose self-service onboarding APIs but enforce strict tenancy validation and audit logging.
Operations, DR & BCP¶
Purpose¶
Define the operational framework for running the SaaS platform in production, including incident response models, disaster recovery (DR) objectives, business continuity planning (BCP), and continuous improvement loops. Ensure resilience by codifying RTO/RPO targets, runbooks, escalation policies, and post-incident learning practices.
Incident Response Model¶
Principles
- Always-on detection: observability signals drive automated alerting.
- Severity-driven triage: classify incidents by impact scope (tenant, edition, platform-wide).
- Clear ownership: each bounded context has an on-call roster with escalation paths.
- Blameless culture: focus on systemic fixes, not individual blame.
Severity Levels
| Sev | Example | Response Target |
|---|---|---|
| 1 (Critical) | Multi-tenant outage, data breach | Respond ≤ 5 min, resolve ≤ 2h |
| 2 (High) | Single-tenant critical service loss | Respond ≤ 15 min, resolve ≤ 4h |
| 3 (Medium) | Performance degradation, delayed jobs | Respond ≤ 1h, resolve ≤ 24h |
| 4 (Low) | Cosmetic UI issue, minor bug | Triage in backlog |
Escalation
- PagerDuty/Teams alerts → On-call engineer → Escalation to service lead → SRE/Architecture Guild if systemic.
- Stakeholder comms via status page + tenant emails.
DR/BCP Targets¶
Disaster Recovery Objectives
- RTO (Recovery Time Objective): ≤ 1 hour for core services, ≤ 4 hours for non-critical.
- RPO (Recovery Point Objective): ≤ 15 minutes for transactional data (SQL, Service Bus), ≤ 1 hour for logs/telemetry.
Continuity Scenarios
- Regional outage: failover to paired Azure region (e.g., West Europe ↔ North Europe).
- Service degradation: reroute workloads to ACA fallback if AKS cluster impaired.
- Data corruption: restore from PITR backups with integrity checks.
- Identity outage: cached tokens + reduced-mode features until IdP restored.
Runbooks & Playbooks¶
Examples
-
Outage Playbook:
- Identify impacted tenants/editions.
- Trigger failover runbook (DNS cutover, service redeploy).
- Notify stakeholders.
- Verify SLO restoration.
-
Data Corruption:
- Halt writes.
- Restore PITR snapshot.
- Re-run integrity checks.
- Replay DLQ events.
-
Service Bus Saturation:
- Scale consumers via KEDA.
- Purge DLQ to holding store for replay.
- Tune retry backoff.
-
Tenant Migration Failure:
- Rollback to previous DB/schema.
- Reassign tenant routing.
- Notify tenant with comms template.
Runbooks stored in docs/runbooks/*.md; each includes steps, required roles, estimated timings, rollback procedures.
Post-Incident Review (PIR)¶
- Held within 72 hours of Sev-½ incident.
- Format: What happened? What was the impact? What went well? What failed? What do we improve?
- Outputs:
- Incident report (public/tenant redacted version + internal detailed version).
- Linked DevOps items for remediation.
- SLA/SLO impact recorded for trend tracking.
Blameless retrospectives mandatory for cultural reinforcement.
Operational KPIs & Continuous Improvement¶
Key Metrics
- MTTR (Mean Time to Recovery)
- MTTA (Mean Time to Acknowledge)
- SLA compliance (%)
- Error budget burn rate
- Change failure rate (CFR)
- Incident recurrence rate
Continuous Improvement Loop
- Collect KPIs + PIR findings.
- Update ADRs or runbooks as needed.
- Feed back into test automation (synthetic checks for past failures).
- Share learnings across product teams in guild sessions.
Observability & Automation¶
- SLO dashboards per service, tenant-aware.
- Error budget alerts tied to rollout progression (halt if exceeded).
- Runbook automation: scripts for DNS cutover, database restore, tenant reroute.
- Chaos engineering: scheduled DR drills to validate RTO/RPO in practice.
Risks & Mitigations¶
| Risk | Impact | Mitigation |
|---|---|---|
| Unclear on-call ownership | Delayed response | On-call schedules automated; service owner registry |
| DR drills skipped | False sense of security | Mandatory quarterly tests with reports |
| PIRs not actioned | Repeat incidents | Link PIR items to DevOps with SLA |
| Overloaded on-call team | Burnout | Rotations, SLO budgets, escalation to guilds |
Solution Architect Notes¶
- Treat DR/BCP as code: IaC templates for failover infra, automated DNS cutovers, scripted PITR restores.
- Ensure multi-tenant context in all incident workflows (know who’s impacted immediately).
- Regularly test rollback and failover—don’t assume Azure SLAs replace DR.
- Continuous improvement is the differentiator: every incident should make the platform stronger.
Real-Life Example SaaS Products¶
Purpose¶
Ground the high-level design in practical, scenario-based examples that demonstrate how the factory can generate distinct SaaS offerings using its reusable patterns. These examples highlight flexibility across industries, compliance requirements, and AI-first capabilities.
Example 1: SaaS CRM Lite¶
- Target Tenants: SMEs and startups.
- Core Contexts Used: Identity, Tenant Management, Config, Notifications.
- Data Model: Pooled DB (multi-tenant shared schema with RLS).
- Editions:
- Free → 5 users, basic CRM (contacts, tasks).
- Standard → unlimited users, reporting dashboards.
- Enterprise → advanced workflows, SSO, API access.
- Extensibility:
- Webhooks for customer lifecycle events.
- Slack and Microsoft Teams integration via outbound events.
- REST APIs for partner add-ons.
Why it works: A simple product leverages the factory’s core scaffolding (identity, multi-tenancy, config) with minimal customization.
Example 2: Healthcare SaaS (HIPAA Overlay)¶
- Target Tenants: Clinics and medical practices.
- Overlays Applied: HIPAA compliance, immutable audit trails, PHI encryption, data residency enforcement.
- Editions:
- Standard → support for multiple clinics under one tenant.
- Enterprise → per-facility isolation (schema-per-tenant or DB-per-tenant).
- Integrations:
- FHIR APIs to connect with EHR systems.
- Secure patient notifications (SMS/email with consent).
- Billing and insurance workflows.
- Data Controls:
- Tokenization of PHI in audit logs.
- WORM storage policies for compliance exports.
Why it works: Demonstrates the policy overlays and regulatory guardrails built into the factory (GDPR/HIPAA packs).
Example 3: AI Knowledge SaaS¶
- Target Tenants: Universities, training providers, research institutions.
- Core Feature: AI-first orchestration via Semantic Kernel.
- Capabilities:
- AI tutors (multi-agent orchestration).
- Knowledge search across uploaded materials.
- Context-aware Q&A with tenant-isolated embeddings.
- Edition Features:
- Free → limited AI queries (e.g., 50/month).
- Standard → shared embeddings, course recommendations.
- Enterprise → private embeddings, multi-agent orchestration, ingestion of tenant-specific datasets.
- Observability:
- Per-tenant AI usage dashboards.
- Quotas to prevent overuse and ensure fair cost allocation.
Why it works: Showcases factory support for AI workloads (semantic orchestration, tenant-aware embeddings, quotas) and highlights the ability to build future-ready products.
Key Takeaways¶
- The factory is industry-agnostic: from lightweight CRM to regulated healthcare to AI-first products.
- Edition and overlay packs adapt the same baseline platform for different compliance and market needs.
- Multi-tenancy and extensibility patterns (webhooks, APIs, events) repeat across all products.
- The platform’s design ensures reusability, speed-to-market, and compliance by default.
Conclusion & Summary¶
Why the Factory Exists¶
The ConnectSoft SaaS Factory was conceived to dramatically reduce time-to-market for SaaS solutions while embedding non-negotiable guardrails for security, compliance, and reliability. By abstracting away repeated engineering concerns — identity, tenancy, observability, compliance, CI/CD, FinOps — the factory allows product teams to focus on differentiated business value instead of reinventing the same platform foundations.
Core Pillars¶
The design presented in this HLD is anchored on four cross-cutting pillars:
- Security by Design — OAuth2/OIDC, workload identities, encryption everywhere, data classification, policy enforcement.
- Multi-Tenancy & Editions — pooled vs. isolated tenancy, per-tenant config/flags, edition overlays, migration paths.
- Observability & Resilience — OTel-first telemetry, traces/logs/metrics, error budgets, SLO dashboards, chaos testing.
- AI-First Orchestration — Semantic Kernel and Microsoft.Extensions.AI enable agentic workflows, intelligent assistants, and future-ready product capabilities.
Together, these pillars ensure every generated SaaS product is compliant, scalable, and adaptable.
Blueprint Completeness¶
Across the sections, the document has defined:
- Vision & Personas → what problems we solve, for whom.
- Bounded Contexts & Architecture → how services decompose, integrate, and evolve.
- Non-Functional Guarantees → SLOs, privacy, compliance, resilience.
- Operational Excellence → CI/CD pipelines, FinOps guardrails, risk registers, DR/BCP.
- Practical Examples → how real SaaS products (CRM, Healthcare, AI Knowledge) emerge from the same templates.
This represents a complete high-level blueprint for both the factory platform and the SaaS solutions it produces.
Path Forward¶
This HLD is not static. It is the foundation for implementation, and will evolve through:
- ADRs: Architecture Decision Records documenting trade-offs and changes.
- Epics & Features: Work planned and executed in Azure DevOps.
- Continuous Improvement: Post-incident reviews, PIR-driven ADRs, and refinements to templates and runbooks.
Every new SaaS product built with the factory contributes learnings and enhancements back into the core templates, making the platform stronger with each iteration.
Final Note¶
The ConnectSoft SaaS Factory provides a repeatable, governed, AI-driven model for delivering SaaS at scale. It balances innovation velocity with enterprise-grade guardrails, ensuring that each generated solution is secure, observable, multi-tenant aware, and ready for production from day one.
This document closes the High-Level Design phase and marks the transition to detailed design and implementation.