Resiliency in ConnectSoft Microservice Template¶

Purpose & Overview¶

Resiliency is the ability of a system to recover gracefully and continue operating despite encountering failures, unexpected loads, or system disruptions. In the ConnectSoft Microservice Template, resiliency is a fundamental design principle that ensures microservices can handle transient faults, network issues, and service outages without compromising user experience or data integrity.

Resiliency encompasses:

Transient Fault Handling: Automatic retry of failed operations with exponential backoff
Circuit Breaking: Preventing cascading failures by stopping requests to failing services
Timeout Management: Ensuring operations complete within acceptable timeframes
Fallback Mechanisms: Providing alternative responses when primary operations fail
Resource Isolation: Preventing failures in one area from affecting others
Rate Limiting: Protecting services from being overwhelmed by traffic
Load Balancing: Distributing traffic across multiple service instances

Resiliency Philosophy

Resiliency is not about preventing failures—it's about designing systems that gracefully handle failures when they occur. The template implements proven patterns and best practices to ensure services remain available, responsive, and reliable even under adverse conditions. Every external dependency interaction should be protected with appropriate resilience strategies.

Architecture Overview¶

Resilience Layers¶

Application Layer
    ├── HTTP Client Resilience
    │   ├── Retry Policies
    │   ├── Circuit Breakers
    │   ├── Timeouts
    │   └── Fallback Responses
    ├── Messaging Resilience
    │   ├── MassTransit Retry Policies
    │   ├── NServiceBus Recoverability
    │   └── Dead Letter Queues
    ├── Background Job Resilience
    │   ├── Hangfire Automatic Retry
    │   └── Job Failure Handling
    └── Database Resilience
        ├── Connection Retry
        ├── Query Timeouts
        └── Transaction Handling

Resilience Pattern Flow¶

Request/Operation
    ↓
Timeout Check
    ↓
Retry Policy (if transient failure)
    ↓
Circuit Breaker Check
    ↓
Bulkhead Isolation
    ↓
Execute Operation
    ↓
Success or Failure Handling
    ↓
Fallback (if configured)

Resilience Patterns¶

Retry Pattern¶

The Retry pattern automatically retries failed operations based on a predefined policy. It's particularly useful for handling transient faults such as network timeouts, temporary service unavailability, or database connection issues.

Characteristics¶

Transient Fault Handling: Only retries operations that might succeed on retry
Exponential Backoff: Increases delay between retries to avoid overwhelming services
Jitter: Adds randomness to prevent thundering herd problems
Idempotency: Ensures operations are safe to retry

Implementation Example (Polly)¶

// Retry policy for HTTP clients
private static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
    return Policy<HttpResponseMessage>
        .Handle<HttpRequestException>()
        .OrResult(r => (int)r.StatusCode >= 500)
        .WaitAndRetryAsync(
            retryCount: 3,
            sleepDurationProvider: retryAttempt => 
                TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), // Exponential backoff
            onRetry: (outcome, timespan, retryCount, context) =>
            {
                var logger = context.GetLogger();
                logger?.LogWarning(
                    "Retry {RetryCount} after {Delay}s",
                    retryCount,
                    timespan.TotalSeconds);
            });
}

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetRetryPolicy());

Best Practices¶

Use Exponential Backoff: Prevents overwhelming failing services
Limit Retry Count: Avoid infinite retry loops
Retry Only Transient Errors: Don't retry 4xx client errors
Ensure Idempotency: Operations must be safe to retry
Log Retry Attempts: For observability and debugging

Circuit Breaker Pattern¶

The Circuit Breaker pattern prevents repeated attempts to call a failing service, allowing it time to recover. It operates in three states: Closed (normal operation), Open (blocking requests), and Half-Open (testing recovery).

Characteristics¶

Failure Threshold: Opens after a certain number of failures
Cooldown Period: Blocks requests for a configured duration
Recovery Testing: Periodically tests if service has recovered
Fast Failure: Immediately rejects requests when circuit is open

Implementation Example (Polly)¶

// Circuit breaker policy
private static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
    return Policy<HttpResponseMessage>
        .Handle<HttpRequestException>()
        .OrResult(r => (int)r.StatusCode >= 500)
        .CircuitBreakerAsync(
            handledEventsAllowedBeforeBreaking: 5,
            durationOfBreak: TimeSpan.FromSeconds(30),
            onBreak: (result, duration) =>
            {
                // Log circuit breaker opened
                logger.LogWarning(
                    "Circuit breaker opened for {Duration}s",
                    duration.TotalSeconds);
            },
            onReset: () =>
            {
                // Log circuit breaker reset
                logger.LogInformation("Circuit breaker reset");
            },
            onHalfOpen: () =>
            {
                // Log circuit breaker half-open
                logger.LogInformation("Circuit breaker half-open");
            });
}

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetCircuitBreakerPolicy());

Circuit Breaker States¶

State	Description	Behavior
Closed	Normal operation	Requests flow through normally
Open	Service is failing	Requests are immediately rejected
Half-Open	Testing recovery	Limited requests allowed to test if service recovered

Best Practices¶

Set Appropriate Thresholds: Balance between fast failure and giving services time to recover
Monitor Circuit State: Log state changes for observability
Combine with Retry: Use retry before circuit breaker
Use Different Circuits: Separate circuits for different services
Configure Cooldown Period: Allow services time to recover

Timeout Pattern¶

The Timeout pattern ensures that operations don't wait indefinitely for responses. It sets maximum time limits for operations, aborting them if the limit is exceeded.

Characteristics¶

Prevents Resource Exhaustion: Avoids threads/connections waiting indefinitely
Improves Responsiveness: Fails fast when services are slow
Configurable Per Operation: Different timeouts for different operations
Cancellation Support: Uses CancellationToken for proper cancellation

Implementation Example (Polly)¶

// Timeout policy
var timeoutPolicy = Policy.TimeoutAsync<HttpResponseMessage>(
    TimeSpan.FromSeconds(10),
    TimeoutStrategy.Pessimistic);

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(timeoutPolicy);

Best Practices¶

Set Realistic Timeouts: Balance between user experience and service capabilities
Use Different Timeouts: Different operations may need different timeouts
Handle TimeoutExceptions: Provide meaningful error messages
Combine with Retry: Retry after timeout (if appropriate)
Monitor Timeout Rates: High timeout rates indicate performance issues

Fallback Pattern¶

The Fallback pattern provides alternative responses or behaviors when primary operations fail. It ensures applications can degrade gracefully instead of crashing.

Characteristics¶

Graceful Degradation: Provides default responses when services fail
User Experience: Maintains functionality even when dependencies fail
Caching: Can use cached data as fallback
Default Values: Provides sensible defaults when data unavailable

Implementation Example (Polly)¶

// Fallback policy
var fallbackPolicy = Policy<HttpResponseMessage>
    .Handle<HttpRequestException>()
    .OrResult(r => !r.IsSuccessStatusCode)
    .FallbackAsync(async (cancellationToken) =>
    {
        // Return cached or default response
        return new HttpResponseMessage(HttpStatusCode.OK)
        {
            Content = new StringContent(JsonSerializer.Serialize(defaultResponse))
        };
    });

// Usage
services.AddHttpClient<UserServiceClient>()
    .AddPolicyHandler(fallbackPolicy);

Best Practices¶

Provide Meaningful Fallbacks: Return useful default data
Use Cached Data: Fallback to cached data when available
Log Fallback Usage: Monitor when fallbacks are triggered
Test Fallback Paths: Ensure fallbacks work correctly
Avoid Masking Issues: Don't hide systemic problems with fallbacks

Bulkhead Pattern¶

The Bulkhead pattern isolates resources (threads, connections, memory) to prevent failures in one part of the system from cascading to others.

Characteristics¶

Resource Isolation: Limits concurrent operations per service
Failure Containment: Prevents one failing service from consuming all resources
Priority Protection: Ensures critical operations have dedicated resources
Configurable Limits: Adjustable based on service capacity

Implementation Example (Polly)¶

// Bulkhead policy
var bulkheadPolicy = Policy
    .BulkheadAsync(
        maxParallelization: 10,
        maxQueuingActions: 5,
        onBulkheadRejectedAsync: async (context) =>
        {
            logger.LogWarning("Bulkhead rejected request");
        });

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(bulkheadPolicy);

Best Practices¶

Isolate Critical Resources: Protect high-priority operations
Set Appropriate Limits: Balance between isolation and resource utilization
Monitor Bulkhead Rejections: Track when requests are rejected
Use Separate Bulkheads: Different bulkheads for different services
Combine with Circuit Breaker: Prevent resource exhaustion

Template-Specific Implementations¶

MassTransit Retry Policies¶

MassTransit provides built-in retry policies for message consumers:

// MassTransitExtensions.cs
config.UseMessageRetry(retryConfig =>
{
    retryConfig.Interval(
        retryCount: 3,
        interval: TimeSpan.FromSeconds(5));
});

Configuration Options:

Immediate Retry: Retry immediately without delay
Interval Retry: Retry with fixed interval
Exponential Retry: Retry with exponential backoff
Exception Filters: Configure which exceptions trigger retries

Example:

config.UseMessageRetry(retryConfig =>
{
    retryConfig.Exponential(
        retryLimit: 5,
        minInterval: TimeSpan.FromSeconds(1),
        maxInterval: TimeSpan.FromSeconds(30),
        intervalDelta: TimeSpan.FromSeconds(2));

    // Don't retry on validation exceptions
    retryConfig.Ignore<ValidationException>();
});

NServiceBus Recoverability Policies¶

NServiceBus provides immediate and delayed retry policies:

// NServiceBus configuration
var recoverability = endpointConfiguration.Recoverability();

// Immediate retries (fail fast)
recoverability.Immediate(immediate =>
{
    immediate.NumberOfRetries(3);
});

// Delayed retries (with exponential backoff)
recoverability.Delayed(delayed =>
{
    delayed.NumberOfRetries(5);
    delayed.TimeIncrease(TimeSpan.FromSeconds(10));
});

// Configure unrecoverable exceptions
recoverability.AddUnrecoverableException<ValidationException>();

Recoverability Flow:

Immediate Retries: Fast retries for transient failures (e.g., database locks)
Delayed Retries: Slower retries with exponential backoff
Error Queue: Messages that fail all retries are sent to error queue

Hangfire Automatic Retry¶

Hangfire provides automatic retry for background jobs:

// HangFireExtensions.cs
GlobalJobFilters.Filters.Add(new AutomaticRetryAttribute 
{ 
    Attempts = 3 
});

// Job-level retry configuration
[AutomaticRetry(Attempts = 3, DelaysInSeconds = new[] { 10, 30, 60 })]
public async Task ProcessScheduledTask()
{
    // Job logic
}

Configuration Options:

Attempts: Maximum number of retry attempts
DelaysInSeconds: Array of delays between retries
LogEvents: Whether to log retry events
OnAttemptsExceeded: Action when all retries are exhausted

Example:

[AutomaticRetry(
    Attempts = 5,
    DelaysInSeconds = new[] { 10, 30, 60, 120, 300 },
    LogEvents = true,
    OnAttemptsExceeded = AttemptsExceededAction.Delete)]
public async Task ProcessBatchJob()
{
    // Job logic
}

Database Connection Resilience¶

Database connections can be configured with retry and timeout policies:

// Orleans connection configuration
connection.ConnectionRetryDelay = TimeSpan.FromSeconds(5);
connection.OpenConnectionTimeout = TimeSpan.FromSeconds(30);

Best Practices:

Connection Retry: Retry database connections with exponential backoff
Query Timeouts: Set appropriate timeouts for database queries
Connection Pooling: Use connection pooling to manage resources
Health Checks: Monitor database connection health

Combining Resilience Patterns¶

Policy Composition¶

Resilience patterns are typically combined to provide comprehensive protection:

services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetRetryPolicy())           // Outer: Retry first
    .AddPolicyHandler(GetCircuitBreakerPolicy())  // Middle: Circuit breaker
    .AddPolicyHandler(timeoutPolicy)              // Inner: Timeout
    .AddHeaderPropagation();

Execution Order (outermost to innermost):

Retry: Retries failed operations
Circuit Breaker: Stops requests if service is failing
Timeout: Ensures operations complete within time limit
Bulkhead: Limits concurrent operations (if configured)

Policy Selection¶

Different services may need different resilience strategies:

// Critical service: Aggressive retry + circuit breaker
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetAggressiveRetryPolicy())
    .AddPolicyHandler(GetCircuitBreakerPolicy());

// Non-critical service: Light retry + timeout
services.AddHttpClient<NotificationServiceClient>()
    .AddPolicyHandler(GetLightRetryPolicy())
    .AddPolicyHandler(timeoutPolicy);

// High-volume service: Bulkhead + circuit breaker
services.AddHttpClient<SearchServiceClient>()
    .AddPolicyHandler(bulkheadPolicy)
    .AddPolicyHandler(GetCircuitBreakerPolicy());

Configuration¶

Resilience Options¶

Resilience policies can be configured via appsettings.json:

{
  "Resilience": {
    "HttpClient": {
      "Retry": {
        "MaxRetries": 3,
        "BaseDelaySeconds": 2,
        "MaxDelaySeconds": 30
      },
      "CircuitBreaker": {
        "FailureThreshold": 5,
        "DurationOfBreakSeconds": 30,
        "MinimumThroughput": 10
      },
      "Timeout": {
        "TimeoutSeconds": 10
      }
    },
    "MassTransit": {
      "Retry": {
        "RetryCount": 3,
        "IntervalSeconds": 5
      }
    },
    "Hangfire": {
      "AutomaticRetries": {
        "MaximumAttempts": 3
      }
    }
  }
}

Environment-Specific Configuration¶

Different environments may require different resilience strategies:

Development:

{
  "Resilience": {
    "HttpClient": {
      "Retry": {
        "MaxRetries": 1
      },
      "CircuitBreaker": {
        "FailureThreshold": 3
      }
    }
  }
}

Production:

{
  "Resilience": {
    "HttpClient": {
      "Retry": {
        "MaxRetries": 5
      },
      "CircuitBreaker": {
        "FailureThreshold": 10
      }
    }
  }
}

Best Practices¶

Do's¶

Use Resilience Patterns for All External Dependencies

// ✅ GOOD - Protected HTTP client
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetRetryPolicy())
    .AddPolicyHandler(GetCircuitBreakerPolicy());

Configure Appropriate Timeouts

// ✅ GOOD - Realistic timeout
var timeout = Policy.TimeoutAsync<HttpResponseMessage>(
    TimeSpan.FromSeconds(10));

Use Exponential Backoff for Retries

// ✅ GOOD - Exponential backoff
.WaitAndRetryAsync(
    retryCount: 3,
    sleepDurationProvider: retryAttempt => 
        TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)));

Monitor Resilience Metrics

// ✅ GOOD - Log retry attempts
onRetry: (outcome, timespan, retryCount, context) =>
{
    logger.LogWarning("Retry {RetryCount}", retryCount);
});

Ensure Idempotency

// ✅ GOOD - Idempotent operation
public async Task ProcessPaymentAsync(PaymentRequest request)
{
    // Check if already processed (idempotency key)
    if (await IsAlreadyProcessed(request.IdempotencyKey))
    {
        return await GetExistingResult(request.IdempotencyKey);
    }

    // Process payment
    return await ProcessNewPayment(request);
}

Combine Patterns Appropriately

// ✅ GOOD - Layered resilience
services.AddHttpClient<ServiceClient>()
    .AddPolicyHandler(retryPolicy)      // Outer
    .AddPolicyHandler(circuitBreaker)   // Middle
    .AddPolicyHandler(timeoutPolicy);   // Inner

Don'ts¶

Don't Retry Non-Transient Errors

// ❌ BAD - Retrying client errors
.HandleResult(r => (int)r.StatusCode >= 400)  // Includes 4xx errors

// ✅ GOOD - Only retry server errors
.HandleResult(r => (int)r.StatusCode >= 500)

Don't Use Infinite Retries

// ❌ BAD - No retry limit
.RetryForever()

// ✅ GOOD - Limited retries
.WaitAndRetryAsync(retryCount: 3, ...)

Don't Ignore Circuit Breaker State

// ❌ BAD - No circuit breaker monitoring
.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));

// ✅ GOOD - Monitor state changes
.CircuitBreakerAsync(
    5, 
    TimeSpan.FromSeconds(30),
    onBreak: (result, duration) => logger.LogWarning(...),
    onReset: () => logger.LogInformation(...));

Don't Use Hardcoded Timeouts

// ❌ BAD - Hardcoded timeout
var timeout = TimeSpan.FromSeconds(5);

// ✅ GOOD - Configurable timeout
var timeout = TimeSpan.FromSeconds(
    configuration.GetValue<int>("Resilience:Timeout:TimeoutSeconds"));

Don't Skip Fallback Mechanisms

// ❌ BAD - No fallback
var response = await httpClient.GetAsync(...);

// ✅ GOOD - Fallback to cached data
try
{
    return await httpClient.GetAsync(...);
}
catch
{
    return await GetCachedData();
}

Observability and Monitoring¶

Metrics to Monitor¶

Metric	Description	Alert Threshold
Retry Rate	Percentage of requests that required retries	> 10%
Circuit Breaker Open Events	Number of times circuit breaker opened	> 5/hour
Timeout Rate	Percentage of requests that timed out	> 5%
Fallback Usage	Number of times fallback was used	> 20/hour
Bulkhead Rejections	Number of requests rejected by bulkhead	> 50/hour

Logging¶

Log resilience events for observability:

onRetry: (outcome, timespan, retryCount, context) =>
{
    logger.LogWarning(
        "Retry {RetryCount} after {Delay}s for {Service}",
        retryCount,
        timespan.TotalSeconds,
        context.ServiceName);
},

onBreak: (result, duration) =>
{
    logger.LogError(
        "Circuit breaker opened for {Duration}s for {Service}",
        duration.TotalSeconds,
        serviceName);
},

onReset: () =>
{
    logger.LogInformation(
        "Circuit breaker reset for {Service}",
        serviceName);
}

Distributed Tracing¶

Resilience events should be included in distributed traces:

using var activity = ActivitySource.StartActivity("HttpClientRequest");

try
{
    var response = await policy.ExecuteAsync(async () =>
        await httpClient.GetAsync(endpoint));

    activity?.SetTag("http.status_code", (int)response.StatusCode);
    activity?.SetTag("resilience.retry_count", retryCount);
}
catch (Exception ex)
{
    activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
    throw;
}

Troubleshooting¶

Issue: High Retry Rate¶

Symptoms: Large percentage of requests require retries.

Solutions: 1. Check downstream service health 2. Review timeout configurations 3. Investigate network issues 4. Consider increasing timeout values 5. Check if service is overloaded

Issue: Circuit Breaker Frequently Opening¶

Symptoms: Circuit breaker opens repeatedly.

Solutions: 1. Investigate root cause of failures 2. Review failure threshold settings 3. Check service health and capacity 4. Consider increasing cooldown period 5. Review retry policies

Issue: Timeouts Too Frequent¶

Symptoms: Many requests timing out.

Solutions: 1. Increase timeout values 2. Check service performance 3. Review network latency 4. Consider service scaling 5. Review query/operation complexity

Issue: Fallback Triggering Too Often¶

Symptoms: Fallback mechanisms frequently used.

Solutions: 1. Investigate why primary operations fail 2. Review service health 3. Check if fallback data is stale 4. Consider improving primary operation reliability 5. Review circuit breaker settings

HTTP Client: HTTP client usage with resilience patterns
Exception Handling: Exception handling with retry policies
MassTransit: Messaging resilience with MassTransit
NServiceBus: Messaging resilience with NServiceBus
Hangfire: Background job resilience with Hangfire
Rate Limiting: Rate limiting as a resilience pattern

Summary¶

Resiliency in the ConnectSoft Microservice Template provides:

✅ Transient Fault Handling: Automatic retry with exponential backoff
✅ Circuit Breaking: Prevents cascading failures
✅ Timeout Management: Ensures operations complete within acceptable timeframes
✅ Fallback Mechanisms: Graceful degradation when services fail
✅ Resource Isolation: Prevents failures from cascading
✅ Messaging Resilience: Built-in retry policies for MassTransit and NServiceBus
✅ Background Job Resilience: Automatic retry for Hangfire jobs
✅ Observability: Comprehensive logging and monitoring

By following these patterns, teams can:

Build Reliable Services: Services continue operating despite failures
Improve User Experience: Graceful degradation maintains functionality
Prevent Cascading Failures: Circuit breakers and bulkheads contain failures
Optimize Resource Usage: Timeouts and bulkheads prevent resource exhaustion
Enable Observability: Comprehensive logging and monitoring for resilience events

Resiliency is essential for building production-ready microservices that can handle the complexities and challenges of distributed systems while maintaining availability, reliability, and performance.

Resiliency in ConnectSoft Microservice Template¶

Purpose & Overview¶

Architecture Overview¶

Resilience Layers¶

Resilience Pattern Flow¶

Resilience Patterns¶

Retry Pattern¶

Characteristics¶

Implementation Example (Polly)¶

Best Practices¶

Circuit Breaker Pattern¶

Characteristics¶

Implementation Example (Polly)¶

Circuit Breaker States¶

Best Practices¶

Timeout Pattern¶

Characteristics¶

Implementation Example (Polly)¶

Best Practices¶

Fallback Pattern¶

Characteristics¶

Implementation Example (Polly)¶

Best Practices¶

Bulkhead Pattern¶

Characteristics¶

Implementation Example (Polly)¶

Best Practices¶

Template-Specific Implementations¶

MassTransit Retry Policies¶

NServiceBus Recoverability Policies¶

Hangfire Automatic Retry¶

Database Connection Resilience¶

Combining Resilience Patterns¶

Policy Composition¶

Policy Selection¶

Configuration¶

Resilience Options¶

Environment-Specific Configuration¶

Best Practices¶

Do's¶

Don'ts¶

Observability and Monitoring¶

Metrics to Monitor¶

Logging¶

Distributed Tracing¶

Troubleshooting¶

Issue: High Retry Rate¶

Issue: Circuit Breaker Frequently Opening¶

Issue: Timeouts Too Frequent¶

Issue: Fallback Triggering Too Often¶

Related Documentation¶

Summary¶