Skip to content

Resiliency in ConnectSoft Microservice Template

Purpose & Overview

Resiliency is the ability of a system to recover gracefully and continue operating despite encountering failures, unexpected loads, or system disruptions. In the ConnectSoft Microservice Template, resiliency is a fundamental design principle that ensures microservices can handle transient faults, network issues, and service outages without compromising user experience or data integrity.

Resiliency encompasses:

  • Transient Fault Handling: Automatic retry of failed operations with exponential backoff
  • Circuit Breaking: Preventing cascading failures by stopping requests to failing services
  • Timeout Management: Ensuring operations complete within acceptable timeframes
  • Fallback Mechanisms: Providing alternative responses when primary operations fail
  • Resource Isolation: Preventing failures in one area from affecting others
  • Rate Limiting: Protecting services from being overwhelmed by traffic
  • Load Balancing: Distributing traffic across multiple service instances

Resiliency Philosophy

Resiliency is not about preventing failures—it's about designing systems that gracefully handle failures when they occur. The template implements proven patterns and best practices to ensure services remain available, responsive, and reliable even under adverse conditions. Every external dependency interaction should be protected with appropriate resilience strategies.

Architecture Overview

Resilience Layers

Application Layer
    ├── HTTP Client Resilience
    │   ├── Retry Policies
    │   ├── Circuit Breakers
    │   ├── Timeouts
    │   └── Fallback Responses
    ├── Messaging Resilience
    │   ├── MassTransit Retry Policies
    │   ├── NServiceBus Recoverability
    │   └── Dead Letter Queues
    ├── Background Job Resilience
    │   ├── Hangfire Automatic Retry
    │   └── Job Failure Handling
    └── Database Resilience
        ├── Connection Retry
        ├── Query Timeouts
        └── Transaction Handling

Resilience Pattern Flow

Request/Operation
Timeout Check
Retry Policy (if transient failure)
Circuit Breaker Check
Bulkhead Isolation
Execute Operation
Success or Failure Handling
Fallback (if configured)

Resilience Patterns

Retry Pattern

The Retry pattern automatically retries failed operations based on a predefined policy. It's particularly useful for handling transient faults such as network timeouts, temporary service unavailability, or database connection issues.

Characteristics

  • Transient Fault Handling: Only retries operations that might succeed on retry
  • Exponential Backoff: Increases delay between retries to avoid overwhelming services
  • Jitter: Adds randomness to prevent thundering herd problems
  • Idempotency: Ensures operations are safe to retry

Implementation Example (Polly)

// Retry policy for HTTP clients
private static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
    return Policy<HttpResponseMessage>
        .Handle<HttpRequestException>()
        .OrResult(r => (int)r.StatusCode >= 500)
        .WaitAndRetryAsync(
            retryCount: 3,
            sleepDurationProvider: retryAttempt => 
                TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), // Exponential backoff
            onRetry: (outcome, timespan, retryCount, context) =>
            {
                var logger = context.GetLogger();
                logger?.LogWarning(
                    "Retry {RetryCount} after {Delay}s",
                    retryCount,
                    timespan.TotalSeconds);
            });
}

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetRetryPolicy());

Best Practices

  1. Use Exponential Backoff: Prevents overwhelming failing services
  2. Limit Retry Count: Avoid infinite retry loops
  3. Retry Only Transient Errors: Don't retry 4xx client errors
  4. Ensure Idempotency: Operations must be safe to retry
  5. Log Retry Attempts: For observability and debugging

Circuit Breaker Pattern

The Circuit Breaker pattern prevents repeated attempts to call a failing service, allowing it time to recover. It operates in three states: Closed (normal operation), Open (blocking requests), and Half-Open (testing recovery).

Characteristics

  • Failure Threshold: Opens after a certain number of failures
  • Cooldown Period: Blocks requests for a configured duration
  • Recovery Testing: Periodically tests if service has recovered
  • Fast Failure: Immediately rejects requests when circuit is open

Implementation Example (Polly)

// Circuit breaker policy
private static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
    return Policy<HttpResponseMessage>
        .Handle<HttpRequestException>()
        .OrResult(r => (int)r.StatusCode >= 500)
        .CircuitBreakerAsync(
            handledEventsAllowedBeforeBreaking: 5,
            durationOfBreak: TimeSpan.FromSeconds(30),
            onBreak: (result, duration) =>
            {
                // Log circuit breaker opened
                logger.LogWarning(
                    "Circuit breaker opened for {Duration}s",
                    duration.TotalSeconds);
            },
            onReset: () =>
            {
                // Log circuit breaker reset
                logger.LogInformation("Circuit breaker reset");
            },
            onHalfOpen: () =>
            {
                // Log circuit breaker half-open
                logger.LogInformation("Circuit breaker half-open");
            });
}

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetCircuitBreakerPolicy());

Circuit Breaker States

State Description Behavior
Closed Normal operation Requests flow through normally
Open Service is failing Requests are immediately rejected
Half-Open Testing recovery Limited requests allowed to test if service recovered

Best Practices

  1. Set Appropriate Thresholds: Balance between fast failure and giving services time to recover
  2. Monitor Circuit State: Log state changes for observability
  3. Combine with Retry: Use retry before circuit breaker
  4. Use Different Circuits: Separate circuits for different services
  5. Configure Cooldown Period: Allow services time to recover

Timeout Pattern

The Timeout pattern ensures that operations don't wait indefinitely for responses. It sets maximum time limits for operations, aborting them if the limit is exceeded.

Characteristics

  • Prevents Resource Exhaustion: Avoids threads/connections waiting indefinitely
  • Improves Responsiveness: Fails fast when services are slow
  • Configurable Per Operation: Different timeouts for different operations
  • Cancellation Support: Uses CancellationToken for proper cancellation

Implementation Example (Polly)

// Timeout policy
var timeoutPolicy = Policy.TimeoutAsync<HttpResponseMessage>(
    TimeSpan.FromSeconds(10),
    TimeoutStrategy.Pessimistic);

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(timeoutPolicy);

Best Practices

  1. Set Realistic Timeouts: Balance between user experience and service capabilities
  2. Use Different Timeouts: Different operations may need different timeouts
  3. Handle TimeoutExceptions: Provide meaningful error messages
  4. Combine with Retry: Retry after timeout (if appropriate)
  5. Monitor Timeout Rates: High timeout rates indicate performance issues

Fallback Pattern

The Fallback pattern provides alternative responses or behaviors when primary operations fail. It ensures applications can degrade gracefully instead of crashing.

Characteristics

  • Graceful Degradation: Provides default responses when services fail
  • User Experience: Maintains functionality even when dependencies fail
  • Caching: Can use cached data as fallback
  • Default Values: Provides sensible defaults when data unavailable

Implementation Example (Polly)

// Fallback policy
var fallbackPolicy = Policy<HttpResponseMessage>
    .Handle<HttpRequestException>()
    .OrResult(r => !r.IsSuccessStatusCode)
    .FallbackAsync(async (cancellationToken) =>
    {
        // Return cached or default response
        return new HttpResponseMessage(HttpStatusCode.OK)
        {
            Content = new StringContent(JsonSerializer.Serialize(defaultResponse))
        };
    });

// Usage
services.AddHttpClient<UserServiceClient>()
    .AddPolicyHandler(fallbackPolicy);

Best Practices

  1. Provide Meaningful Fallbacks: Return useful default data
  2. Use Cached Data: Fallback to cached data when available
  3. Log Fallback Usage: Monitor when fallbacks are triggered
  4. Test Fallback Paths: Ensure fallbacks work correctly
  5. Avoid Masking Issues: Don't hide systemic problems with fallbacks

Bulkhead Pattern

The Bulkhead pattern isolates resources (threads, connections, memory) to prevent failures in one part of the system from cascading to others.

Characteristics

  • Resource Isolation: Limits concurrent operations per service
  • Failure Containment: Prevents one failing service from consuming all resources
  • Priority Protection: Ensures critical operations have dedicated resources
  • Configurable Limits: Adjustable based on service capacity

Implementation Example (Polly)

// Bulkhead policy
var bulkheadPolicy = Policy
    .BulkheadAsync(
        maxParallelization: 10,
        maxQueuingActions: 5,
        onBulkheadRejectedAsync: async (context) =>
        {
            logger.LogWarning("Bulkhead rejected request");
        });

// Usage
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(bulkheadPolicy);

Best Practices

  1. Isolate Critical Resources: Protect high-priority operations
  2. Set Appropriate Limits: Balance between isolation and resource utilization
  3. Monitor Bulkhead Rejections: Track when requests are rejected
  4. Use Separate Bulkheads: Different bulkheads for different services
  5. Combine with Circuit Breaker: Prevent resource exhaustion

Template-Specific Implementations

MassTransit Retry Policies

MassTransit provides built-in retry policies for message consumers:

// MassTransitExtensions.cs
config.UseMessageRetry(retryConfig =>
{
    retryConfig.Interval(
        retryCount: 3,
        interval: TimeSpan.FromSeconds(5));
});

Configuration Options:

  • Immediate Retry: Retry immediately without delay
  • Interval Retry: Retry with fixed interval
  • Exponential Retry: Retry with exponential backoff
  • Exception Filters: Configure which exceptions trigger retries

Example:

config.UseMessageRetry(retryConfig =>
{
    retryConfig.Exponential(
        retryLimit: 5,
        minInterval: TimeSpan.FromSeconds(1),
        maxInterval: TimeSpan.FromSeconds(30),
        intervalDelta: TimeSpan.FromSeconds(2));

    // Don't retry on validation exceptions
    retryConfig.Ignore<ValidationException>();
});

NServiceBus Recoverability Policies

NServiceBus provides immediate and delayed retry policies:

// NServiceBus configuration
var recoverability = endpointConfiguration.Recoverability();

// Immediate retries (fail fast)
recoverability.Immediate(immediate =>
{
    immediate.NumberOfRetries(3);
});

// Delayed retries (with exponential backoff)
recoverability.Delayed(delayed =>
{
    delayed.NumberOfRetries(5);
    delayed.TimeIncrease(TimeSpan.FromSeconds(10));
});

// Configure unrecoverable exceptions
recoverability.AddUnrecoverableException<ValidationException>();

Recoverability Flow:

  1. Immediate Retries: Fast retries for transient failures (e.g., database locks)
  2. Delayed Retries: Slower retries with exponential backoff
  3. Error Queue: Messages that fail all retries are sent to error queue

Hangfire Automatic Retry

Hangfire provides automatic retry for background jobs:

// HangFireExtensions.cs
GlobalJobFilters.Filters.Add(new AutomaticRetryAttribute 
{ 
    Attempts = 3 
});

// Job-level retry configuration
[AutomaticRetry(Attempts = 3, DelaysInSeconds = new[] { 10, 30, 60 })]
public async Task ProcessScheduledTask()
{
    // Job logic
}

Configuration Options:

  • Attempts: Maximum number of retry attempts
  • DelaysInSeconds: Array of delays between retries
  • LogEvents: Whether to log retry events
  • OnAttemptsExceeded: Action when all retries are exhausted

Example:

[AutomaticRetry(
    Attempts = 5,
    DelaysInSeconds = new[] { 10, 30, 60, 120, 300 },
    LogEvents = true,
    OnAttemptsExceeded = AttemptsExceededAction.Delete)]
public async Task ProcessBatchJob()
{
    // Job logic
}

Database Connection Resilience

Database connections can be configured with retry and timeout policies:

// Orleans connection configuration
connection.ConnectionRetryDelay = TimeSpan.FromSeconds(5);
connection.OpenConnectionTimeout = TimeSpan.FromSeconds(30);

Best Practices:

  1. Connection Retry: Retry database connections with exponential backoff
  2. Query Timeouts: Set appropriate timeouts for database queries
  3. Connection Pooling: Use connection pooling to manage resources
  4. Health Checks: Monitor database connection health

Combining Resilience Patterns

Policy Composition

Resilience patterns are typically combined to provide comprehensive protection:

services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetRetryPolicy())           // Outer: Retry first
    .AddPolicyHandler(GetCircuitBreakerPolicy())  // Middle: Circuit breaker
    .AddPolicyHandler(timeoutPolicy)              // Inner: Timeout
    .AddHeaderPropagation();

Execution Order (outermost to innermost):

  1. Retry: Retries failed operations
  2. Circuit Breaker: Stops requests if service is failing
  3. Timeout: Ensures operations complete within time limit
  4. Bulkhead: Limits concurrent operations (if configured)

Policy Selection

Different services may need different resilience strategies:

// Critical service: Aggressive retry + circuit breaker
services.AddHttpClient<PaymentServiceClient>()
    .AddPolicyHandler(GetAggressiveRetryPolicy())
    .AddPolicyHandler(GetCircuitBreakerPolicy());

// Non-critical service: Light retry + timeout
services.AddHttpClient<NotificationServiceClient>()
    .AddPolicyHandler(GetLightRetryPolicy())
    .AddPolicyHandler(timeoutPolicy);

// High-volume service: Bulkhead + circuit breaker
services.AddHttpClient<SearchServiceClient>()
    .AddPolicyHandler(bulkheadPolicy)
    .AddPolicyHandler(GetCircuitBreakerPolicy());

Configuration

Resilience Options

Resilience policies can be configured via appsettings.json:

{
  "Resilience": {
    "HttpClient": {
      "Retry": {
        "MaxRetries": 3,
        "BaseDelaySeconds": 2,
        "MaxDelaySeconds": 30
      },
      "CircuitBreaker": {
        "FailureThreshold": 5,
        "DurationOfBreakSeconds": 30,
        "MinimumThroughput": 10
      },
      "Timeout": {
        "TimeoutSeconds": 10
      }
    },
    "MassTransit": {
      "Retry": {
        "RetryCount": 3,
        "IntervalSeconds": 5
      }
    },
    "Hangfire": {
      "AutomaticRetries": {
        "MaximumAttempts": 3
      }
    }
  }
}

Environment-Specific Configuration

Different environments may require different resilience strategies:

Development:

{
  "Resilience": {
    "HttpClient": {
      "Retry": {
        "MaxRetries": 1
      },
      "CircuitBreaker": {
        "FailureThreshold": 3
      }
    }
  }
}

Production:

{
  "Resilience": {
    "HttpClient": {
      "Retry": {
        "MaxRetries": 5
      },
      "CircuitBreaker": {
        "FailureThreshold": 10
      }
    }
  }
}

Best Practices

Do's

  1. Use Resilience Patterns for All External Dependencies

    // ✅ GOOD - Protected HTTP client
    services.AddHttpClient<PaymentServiceClient>()
        .AddPolicyHandler(GetRetryPolicy())
        .AddPolicyHandler(GetCircuitBreakerPolicy());
    

  2. Configure Appropriate Timeouts

    // ✅ GOOD - Realistic timeout
    var timeout = Policy.TimeoutAsync<HttpResponseMessage>(
        TimeSpan.FromSeconds(10));
    

  3. Use Exponential Backoff for Retries

    // ✅ GOOD - Exponential backoff
    .WaitAndRetryAsync(
        retryCount: 3,
        sleepDurationProvider: retryAttempt => 
            TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)));
    

  4. Monitor Resilience Metrics

    // ✅ GOOD - Log retry attempts
    onRetry: (outcome, timespan, retryCount, context) =>
    {
        logger.LogWarning("Retry {RetryCount}", retryCount);
    });
    

  5. Ensure Idempotency

    // ✅ GOOD - Idempotent operation
    public async Task ProcessPaymentAsync(PaymentRequest request)
    {
        // Check if already processed (idempotency key)
        if (await IsAlreadyProcessed(request.IdempotencyKey))
        {
            return await GetExistingResult(request.IdempotencyKey);
        }
    
        // Process payment
        return await ProcessNewPayment(request);
    }
    

  6. Combine Patterns Appropriately

    // ✅ GOOD - Layered resilience
    services.AddHttpClient<ServiceClient>()
        .AddPolicyHandler(retryPolicy)      // Outer
        .AddPolicyHandler(circuitBreaker)   // Middle
        .AddPolicyHandler(timeoutPolicy);   // Inner
    

Don'ts

  1. Don't Retry Non-Transient Errors

    // ❌ BAD - Retrying client errors
    .HandleResult(r => (int)r.StatusCode >= 400)  // Includes 4xx errors
    
    // ✅ GOOD - Only retry server errors
    .HandleResult(r => (int)r.StatusCode >= 500)
    

  2. Don't Use Infinite Retries

    // ❌ BAD - No retry limit
    .RetryForever()
    
    // ✅ GOOD - Limited retries
    .WaitAndRetryAsync(retryCount: 3, ...)
    

  3. Don't Ignore Circuit Breaker State

    // ❌ BAD - No circuit breaker monitoring
    .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));
    
    // ✅ GOOD - Monitor state changes
    .CircuitBreakerAsync(
        5, 
        TimeSpan.FromSeconds(30),
        onBreak: (result, duration) => logger.LogWarning(...),
        onReset: () => logger.LogInformation(...));
    

  4. Don't Use Hardcoded Timeouts

    // ❌ BAD - Hardcoded timeout
    var timeout = TimeSpan.FromSeconds(5);
    
    // ✅ GOOD - Configurable timeout
    var timeout = TimeSpan.FromSeconds(
        configuration.GetValue<int>("Resilience:Timeout:TimeoutSeconds"));
    

  5. Don't Skip Fallback Mechanisms

    // ❌ BAD - No fallback
    var response = await httpClient.GetAsync(...);
    
    // ✅ GOOD - Fallback to cached data
    try
    {
        return await httpClient.GetAsync(...);
    }
    catch
    {
        return await GetCachedData();
    }
    

Observability and Monitoring

Metrics to Monitor

Metric Description Alert Threshold
Retry Rate Percentage of requests that required retries > 10%
Circuit Breaker Open Events Number of times circuit breaker opened > 5/hour
Timeout Rate Percentage of requests that timed out > 5%
Fallback Usage Number of times fallback was used > 20/hour
Bulkhead Rejections Number of requests rejected by bulkhead > 50/hour

Logging

Log resilience events for observability:

onRetry: (outcome, timespan, retryCount, context) =>
{
    logger.LogWarning(
        "Retry {RetryCount} after {Delay}s for {Service}",
        retryCount,
        timespan.TotalSeconds,
        context.ServiceName);
},

onBreak: (result, duration) =>
{
    logger.LogError(
        "Circuit breaker opened for {Duration}s for {Service}",
        duration.TotalSeconds,
        serviceName);
},

onReset: () =>
{
    logger.LogInformation(
        "Circuit breaker reset for {Service}",
        serviceName);
}

Distributed Tracing

Resilience events should be included in distributed traces:

using var activity = ActivitySource.StartActivity("HttpClientRequest");

try
{
    var response = await policy.ExecuteAsync(async () =>
        await httpClient.GetAsync(endpoint));

    activity?.SetTag("http.status_code", (int)response.StatusCode);
    activity?.SetTag("resilience.retry_count", retryCount);
}
catch (Exception ex)
{
    activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
    throw;
}

Troubleshooting

Issue: High Retry Rate

Symptoms: Large percentage of requests require retries.

Solutions: 1. Check downstream service health 2. Review timeout configurations 3. Investigate network issues 4. Consider increasing timeout values 5. Check if service is overloaded

Issue: Circuit Breaker Frequently Opening

Symptoms: Circuit breaker opens repeatedly.

Solutions: 1. Investigate root cause of failures 2. Review failure threshold settings 3. Check service health and capacity 4. Consider increasing cooldown period 5. Review retry policies

Issue: Timeouts Too Frequent

Symptoms: Many requests timing out.

Solutions: 1. Increase timeout values 2. Check service performance 3. Review network latency 4. Consider service scaling 5. Review query/operation complexity

Issue: Fallback Triggering Too Often

Symptoms: Fallback mechanisms frequently used.

Solutions: 1. Investigate why primary operations fail 2. Review service health 3. Check if fallback data is stale 4. Consider improving primary operation reliability 5. Review circuit breaker settings

Summary

Resiliency in the ConnectSoft Microservice Template provides:

  • Transient Fault Handling: Automatic retry with exponential backoff
  • Circuit Breaking: Prevents cascading failures
  • Timeout Management: Ensures operations complete within acceptable timeframes
  • Fallback Mechanisms: Graceful degradation when services fail
  • Resource Isolation: Prevents failures from cascading
  • Messaging Resilience: Built-in retry policies for MassTransit and NServiceBus
  • Background Job Resilience: Automatic retry for Hangfire jobs
  • Observability: Comprehensive logging and monitoring

By following these patterns, teams can:

  • Build Reliable Services: Services continue operating despite failures
  • Improve User Experience: Graceful degradation maintains functionality
  • Prevent Cascading Failures: Circuit breakers and bulkheads contain failures
  • Optimize Resource Usage: Timeouts and bulkheads prevent resource exhaustion
  • Enable Observability: Comprehensive logging and monitoring for resilience events

Resiliency is essential for building production-ready microservices that can handle the complexities and challenges of distributed systems while maintaining availability, reliability, and performance.