Resiliency in ConnectSoft Microservice Template¶
Purpose & Overview¶
Resiliency is the ability of a system to recover gracefully and continue operating despite encountering failures, unexpected loads, or system disruptions. In the ConnectSoft Microservice Template, resiliency is a fundamental design principle that ensures microservices can handle transient faults, network issues, and service outages without compromising user experience or data integrity.
Resiliency encompasses:
- Transient Fault Handling: Automatic retry of failed operations with exponential backoff
- Circuit Breaking: Preventing cascading failures by stopping requests to failing services
- Timeout Management: Ensuring operations complete within acceptable timeframes
- Fallback Mechanisms: Providing alternative responses when primary operations fail
- Resource Isolation: Preventing failures in one area from affecting others
- Rate Limiting: Protecting services from being overwhelmed by traffic
- Load Balancing: Distributing traffic across multiple service instances
Resiliency Philosophy
Resiliency is not about preventing failures—it's about designing systems that gracefully handle failures when they occur. The template implements proven patterns and best practices to ensure services remain available, responsive, and reliable even under adverse conditions. Every external dependency interaction should be protected with appropriate resilience strategies.
Architecture Overview¶
Resilience Layers¶
Application Layer
├── HTTP Client Resilience
│ ├── Retry Policies
│ ├── Circuit Breakers
│ ├── Timeouts
│ └── Fallback Responses
├── Messaging Resilience
│ ├── MassTransit Retry Policies
│ ├── NServiceBus Recoverability
│ └── Dead Letter Queues
├── Background Job Resilience
│ ├── Hangfire Automatic Retry
│ └── Job Failure Handling
└── Database Resilience
├── Connection Retry
├── Query Timeouts
└── Transaction Handling
Resilience Pattern Flow¶
Request/Operation
↓
Timeout Check
↓
Retry Policy (if transient failure)
↓
Circuit Breaker Check
↓
Bulkhead Isolation
↓
Execute Operation
↓
Success or Failure Handling
↓
Fallback (if configured)
Resilience Patterns¶
Retry Pattern¶
The Retry pattern automatically retries failed operations based on a predefined policy. It's particularly useful for handling transient faults such as network timeouts, temporary service unavailability, or database connection issues.
Characteristics¶
- Transient Fault Handling: Only retries operations that might succeed on retry
- Exponential Backoff: Increases delay between retries to avoid overwhelming services
- Jitter: Adds randomness to prevent thundering herd problems
- Idempotency: Ensures operations are safe to retry
Implementation Example (Polly)¶
// Retry policy for HTTP clients
private static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
return Policy<HttpResponseMessage>
.Handle<HttpRequestException>()
.OrResult(r => (int)r.StatusCode >= 500)
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), // Exponential backoff
onRetry: (outcome, timespan, retryCount, context) =>
{
var logger = context.GetLogger();
logger?.LogWarning(
"Retry {RetryCount} after {Delay}s",
retryCount,
timespan.TotalSeconds);
});
}
// Usage
services.AddHttpClient<PaymentServiceClient>()
.AddPolicyHandler(GetRetryPolicy());
Best Practices¶
- Use Exponential Backoff: Prevents overwhelming failing services
- Limit Retry Count: Avoid infinite retry loops
- Retry Only Transient Errors: Don't retry 4xx client errors
- Ensure Idempotency: Operations must be safe to retry
- Log Retry Attempts: For observability and debugging
Circuit Breaker Pattern¶
The Circuit Breaker pattern prevents repeated attempts to call a failing service, allowing it time to recover. It operates in three states: Closed (normal operation), Open (blocking requests), and Half-Open (testing recovery).
Characteristics¶
- Failure Threshold: Opens after a certain number of failures
- Cooldown Period: Blocks requests for a configured duration
- Recovery Testing: Periodically tests if service has recovered
- Fast Failure: Immediately rejects requests when circuit is open
Implementation Example (Polly)¶
// Circuit breaker policy
private static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
return Policy<HttpResponseMessage>
.Handle<HttpRequestException>()
.OrResult(r => (int)r.StatusCode >= 500)
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (result, duration) =>
{
// Log circuit breaker opened
logger.LogWarning(
"Circuit breaker opened for {Duration}s",
duration.TotalSeconds);
},
onReset: () =>
{
// Log circuit breaker reset
logger.LogInformation("Circuit breaker reset");
},
onHalfOpen: () =>
{
// Log circuit breaker half-open
logger.LogInformation("Circuit breaker half-open");
});
}
// Usage
services.AddHttpClient<PaymentServiceClient>()
.AddPolicyHandler(GetCircuitBreakerPolicy());
Circuit Breaker States¶
| State | Description | Behavior |
|---|---|---|
| Closed | Normal operation | Requests flow through normally |
| Open | Service is failing | Requests are immediately rejected |
| Half-Open | Testing recovery | Limited requests allowed to test if service recovered |
Best Practices¶
- Set Appropriate Thresholds: Balance between fast failure and giving services time to recover
- Monitor Circuit State: Log state changes for observability
- Combine with Retry: Use retry before circuit breaker
- Use Different Circuits: Separate circuits for different services
- Configure Cooldown Period: Allow services time to recover
Timeout Pattern¶
The Timeout pattern ensures that operations don't wait indefinitely for responses. It sets maximum time limits for operations, aborting them if the limit is exceeded.
Characteristics¶
- Prevents Resource Exhaustion: Avoids threads/connections waiting indefinitely
- Improves Responsiveness: Fails fast when services are slow
- Configurable Per Operation: Different timeouts for different operations
- Cancellation Support: Uses
CancellationTokenfor proper cancellation
Implementation Example (Polly)¶
// Timeout policy
var timeoutPolicy = Policy.TimeoutAsync<HttpResponseMessage>(
TimeSpan.FromSeconds(10),
TimeoutStrategy.Pessimistic);
// Usage
services.AddHttpClient<PaymentServiceClient>()
.AddPolicyHandler(timeoutPolicy);
Best Practices¶
- Set Realistic Timeouts: Balance between user experience and service capabilities
- Use Different Timeouts: Different operations may need different timeouts
- Handle TimeoutExceptions: Provide meaningful error messages
- Combine with Retry: Retry after timeout (if appropriate)
- Monitor Timeout Rates: High timeout rates indicate performance issues
Fallback Pattern¶
The Fallback pattern provides alternative responses or behaviors when primary operations fail. It ensures applications can degrade gracefully instead of crashing.
Characteristics¶
- Graceful Degradation: Provides default responses when services fail
- User Experience: Maintains functionality even when dependencies fail
- Caching: Can use cached data as fallback
- Default Values: Provides sensible defaults when data unavailable
Implementation Example (Polly)¶
// Fallback policy
var fallbackPolicy = Policy<HttpResponseMessage>
.Handle<HttpRequestException>()
.OrResult(r => !r.IsSuccessStatusCode)
.FallbackAsync(async (cancellationToken) =>
{
// Return cached or default response
return new HttpResponseMessage(HttpStatusCode.OK)
{
Content = new StringContent(JsonSerializer.Serialize(defaultResponse))
};
});
// Usage
services.AddHttpClient<UserServiceClient>()
.AddPolicyHandler(fallbackPolicy);
Best Practices¶
- Provide Meaningful Fallbacks: Return useful default data
- Use Cached Data: Fallback to cached data when available
- Log Fallback Usage: Monitor when fallbacks are triggered
- Test Fallback Paths: Ensure fallbacks work correctly
- Avoid Masking Issues: Don't hide systemic problems with fallbacks
Bulkhead Pattern¶
The Bulkhead pattern isolates resources (threads, connections, memory) to prevent failures in one part of the system from cascading to others.
Characteristics¶
- Resource Isolation: Limits concurrent operations per service
- Failure Containment: Prevents one failing service from consuming all resources
- Priority Protection: Ensures critical operations have dedicated resources
- Configurable Limits: Adjustable based on service capacity
Implementation Example (Polly)¶
// Bulkhead policy
var bulkheadPolicy = Policy
.BulkheadAsync(
maxParallelization: 10,
maxQueuingActions: 5,
onBulkheadRejectedAsync: async (context) =>
{
logger.LogWarning("Bulkhead rejected request");
});
// Usage
services.AddHttpClient<PaymentServiceClient>()
.AddPolicyHandler(bulkheadPolicy);
Best Practices¶
- Isolate Critical Resources: Protect high-priority operations
- Set Appropriate Limits: Balance between isolation and resource utilization
- Monitor Bulkhead Rejections: Track when requests are rejected
- Use Separate Bulkheads: Different bulkheads for different services
- Combine with Circuit Breaker: Prevent resource exhaustion
Template-Specific Implementations¶
MassTransit Retry Policies¶
MassTransit provides built-in retry policies for message consumers:
// MassTransitExtensions.cs
config.UseMessageRetry(retryConfig =>
{
retryConfig.Interval(
retryCount: 3,
interval: TimeSpan.FromSeconds(5));
});
Configuration Options:
- Immediate Retry: Retry immediately without delay
- Interval Retry: Retry with fixed interval
- Exponential Retry: Retry with exponential backoff
- Exception Filters: Configure which exceptions trigger retries
Example:
config.UseMessageRetry(retryConfig =>
{
retryConfig.Exponential(
retryLimit: 5,
minInterval: TimeSpan.FromSeconds(1),
maxInterval: TimeSpan.FromSeconds(30),
intervalDelta: TimeSpan.FromSeconds(2));
// Don't retry on validation exceptions
retryConfig.Ignore<ValidationException>();
});
NServiceBus Recoverability Policies¶
NServiceBus provides immediate and delayed retry policies:
// NServiceBus configuration
var recoverability = endpointConfiguration.Recoverability();
// Immediate retries (fail fast)
recoverability.Immediate(immediate =>
{
immediate.NumberOfRetries(3);
});
// Delayed retries (with exponential backoff)
recoverability.Delayed(delayed =>
{
delayed.NumberOfRetries(5);
delayed.TimeIncrease(TimeSpan.FromSeconds(10));
});
// Configure unrecoverable exceptions
recoverability.AddUnrecoverableException<ValidationException>();
Recoverability Flow:
- Immediate Retries: Fast retries for transient failures (e.g., database locks)
- Delayed Retries: Slower retries with exponential backoff
- Error Queue: Messages that fail all retries are sent to error queue
Hangfire Automatic Retry¶
Hangfire provides automatic retry for background jobs:
// HangFireExtensions.cs
GlobalJobFilters.Filters.Add(new AutomaticRetryAttribute
{
Attempts = 3
});
// Job-level retry configuration
[AutomaticRetry(Attempts = 3, DelaysInSeconds = new[] { 10, 30, 60 })]
public async Task ProcessScheduledTask()
{
// Job logic
}
Configuration Options:
- Attempts: Maximum number of retry attempts
- DelaysInSeconds: Array of delays between retries
- LogEvents: Whether to log retry events
- OnAttemptsExceeded: Action when all retries are exhausted
Example:
[AutomaticRetry(
Attempts = 5,
DelaysInSeconds = new[] { 10, 30, 60, 120, 300 },
LogEvents = true,
OnAttemptsExceeded = AttemptsExceededAction.Delete)]
public async Task ProcessBatchJob()
{
// Job logic
}
Database Connection Resilience¶
Database connections can be configured with retry and timeout policies:
// Orleans connection configuration
connection.ConnectionRetryDelay = TimeSpan.FromSeconds(5);
connection.OpenConnectionTimeout = TimeSpan.FromSeconds(30);
Best Practices:
- Connection Retry: Retry database connections with exponential backoff
- Query Timeouts: Set appropriate timeouts for database queries
- Connection Pooling: Use connection pooling to manage resources
- Health Checks: Monitor database connection health
Combining Resilience Patterns¶
Policy Composition¶
Resilience patterns are typically combined to provide comprehensive protection:
services.AddHttpClient<PaymentServiceClient>()
.AddPolicyHandler(GetRetryPolicy()) // Outer: Retry first
.AddPolicyHandler(GetCircuitBreakerPolicy()) // Middle: Circuit breaker
.AddPolicyHandler(timeoutPolicy) // Inner: Timeout
.AddHeaderPropagation();
Execution Order (outermost to innermost):
- Retry: Retries failed operations
- Circuit Breaker: Stops requests if service is failing
- Timeout: Ensures operations complete within time limit
- Bulkhead: Limits concurrent operations (if configured)
Policy Selection¶
Different services may need different resilience strategies:
// Critical service: Aggressive retry + circuit breaker
services.AddHttpClient<PaymentServiceClient>()
.AddPolicyHandler(GetAggressiveRetryPolicy())
.AddPolicyHandler(GetCircuitBreakerPolicy());
// Non-critical service: Light retry + timeout
services.AddHttpClient<NotificationServiceClient>()
.AddPolicyHandler(GetLightRetryPolicy())
.AddPolicyHandler(timeoutPolicy);
// High-volume service: Bulkhead + circuit breaker
services.AddHttpClient<SearchServiceClient>()
.AddPolicyHandler(bulkheadPolicy)
.AddPolicyHandler(GetCircuitBreakerPolicy());
Configuration¶
Resilience Options¶
Resilience policies can be configured via appsettings.json:
{
"Resilience": {
"HttpClient": {
"Retry": {
"MaxRetries": 3,
"BaseDelaySeconds": 2,
"MaxDelaySeconds": 30
},
"CircuitBreaker": {
"FailureThreshold": 5,
"DurationOfBreakSeconds": 30,
"MinimumThroughput": 10
},
"Timeout": {
"TimeoutSeconds": 10
}
},
"MassTransit": {
"Retry": {
"RetryCount": 3,
"IntervalSeconds": 5
}
},
"Hangfire": {
"AutomaticRetries": {
"MaximumAttempts": 3
}
}
}
}
Environment-Specific Configuration¶
Different environments may require different resilience strategies:
Development:
{
"Resilience": {
"HttpClient": {
"Retry": {
"MaxRetries": 1
},
"CircuitBreaker": {
"FailureThreshold": 3
}
}
}
}
Production:
{
"Resilience": {
"HttpClient": {
"Retry": {
"MaxRetries": 5
},
"CircuitBreaker": {
"FailureThreshold": 10
}
}
}
}
Best Practices¶
Do's¶
-
Use Resilience Patterns for All External Dependencies
-
Configure Appropriate Timeouts
-
Use Exponential Backoff for Retries
-
Monitor Resilience Metrics
-
Ensure Idempotency
// ✅ GOOD - Idempotent operation public async Task ProcessPaymentAsync(PaymentRequest request) { // Check if already processed (idempotency key) if (await IsAlreadyProcessed(request.IdempotencyKey)) { return await GetExistingResult(request.IdempotencyKey); } // Process payment return await ProcessNewPayment(request); } -
Combine Patterns Appropriately
Don'ts¶
-
Don't Retry Non-Transient Errors
-
Don't Use Infinite Retries
-
Don't Ignore Circuit Breaker State
-
Don't Use Hardcoded Timeouts
-
Don't Skip Fallback Mechanisms
Observability and Monitoring¶
Metrics to Monitor¶
| Metric | Description | Alert Threshold |
|---|---|---|
| Retry Rate | Percentage of requests that required retries | > 10% |
| Circuit Breaker Open Events | Number of times circuit breaker opened | > 5/hour |
| Timeout Rate | Percentage of requests that timed out | > 5% |
| Fallback Usage | Number of times fallback was used | > 20/hour |
| Bulkhead Rejections | Number of requests rejected by bulkhead | > 50/hour |
Logging¶
Log resilience events for observability:
onRetry: (outcome, timespan, retryCount, context) =>
{
logger.LogWarning(
"Retry {RetryCount} after {Delay}s for {Service}",
retryCount,
timespan.TotalSeconds,
context.ServiceName);
},
onBreak: (result, duration) =>
{
logger.LogError(
"Circuit breaker opened for {Duration}s for {Service}",
duration.TotalSeconds,
serviceName);
},
onReset: () =>
{
logger.LogInformation(
"Circuit breaker reset for {Service}",
serviceName);
}
Distributed Tracing¶
Resilience events should be included in distributed traces:
using var activity = ActivitySource.StartActivity("HttpClientRequest");
try
{
var response = await policy.ExecuteAsync(async () =>
await httpClient.GetAsync(endpoint));
activity?.SetTag("http.status_code", (int)response.StatusCode);
activity?.SetTag("resilience.retry_count", retryCount);
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
throw;
}
Troubleshooting¶
Issue: High Retry Rate¶
Symptoms: Large percentage of requests require retries.
Solutions: 1. Check downstream service health 2. Review timeout configurations 3. Investigate network issues 4. Consider increasing timeout values 5. Check if service is overloaded
Issue: Circuit Breaker Frequently Opening¶
Symptoms: Circuit breaker opens repeatedly.
Solutions: 1. Investigate root cause of failures 2. Review failure threshold settings 3. Check service health and capacity 4. Consider increasing cooldown period 5. Review retry policies
Issue: Timeouts Too Frequent¶
Symptoms: Many requests timing out.
Solutions: 1. Increase timeout values 2. Check service performance 3. Review network latency 4. Consider service scaling 5. Review query/operation complexity
Issue: Fallback Triggering Too Often¶
Symptoms: Fallback mechanisms frequently used.
Solutions: 1. Investigate why primary operations fail 2. Review service health 3. Check if fallback data is stale 4. Consider improving primary operation reliability 5. Review circuit breaker settings
Related Documentation¶
- HTTP Client: HTTP client usage with resilience patterns
- Exception Handling: Exception handling with retry policies
- MassTransit: Messaging resilience with MassTransit
- NServiceBus: Messaging resilience with NServiceBus
- Hangfire: Background job resilience with Hangfire
- Rate Limiting: Rate limiting as a resilience pattern
Summary¶
Resiliency in the ConnectSoft Microservice Template provides:
- ✅ Transient Fault Handling: Automatic retry with exponential backoff
- ✅ Circuit Breaking: Prevents cascading failures
- ✅ Timeout Management: Ensures operations complete within acceptable timeframes
- ✅ Fallback Mechanisms: Graceful degradation when services fail
- ✅ Resource Isolation: Prevents failures from cascading
- ✅ Messaging Resilience: Built-in retry policies for MassTransit and NServiceBus
- ✅ Background Job Resilience: Automatic retry for Hangfire jobs
- ✅ Observability: Comprehensive logging and monitoring
By following these patterns, teams can:
- Build Reliable Services: Services continue operating despite failures
- Improve User Experience: Graceful degradation maintains functionality
- Prevent Cascading Failures: Circuit breakers and bulkheads contain failures
- Optimize Resource Usage: Timeouts and bulkheads prevent resource exhaustion
- Enable Observability: Comprehensive logging and monitoring for resilience events
Resiliency is essential for building production-ready microservices that can handle the complexities and challenges of distributed systems while maintaining availability, reliability, and performance.