Resiliency and Chaos Engineering Guide: ConnectSoft API Library Template¶
The ConnectSoft API Library Template includes comprehensive resiliency patterns and chaos engineering capabilities to ensure reliable API communication under various failure conditions. This guide explains the resiliency mechanisms, their configuration, and how to use chaos injection for testing.
Resiliency Patterns Overview¶
The template provides two main resiliency patterns:
- Standard Resilience Handler: Retry-based resilience with circuit breaker, timeout, bulkhead, and rate limiter.
- Standard Hedging Handler: Parallel request execution for slow dependencies with endpoint-specific strategies.
Both patterns are built on Microsoft.Extensions.Http.Resilience and Polly.Core, providing industry-standard fault tolerance mechanisms.
Standard Resilience Handler¶
The standard resilience handler provides a layered approach to fault tolerance with five chained strategies (from outermost to innermost):
- Bulkhead: Limits concurrent requests
- Total Request Timeout: Overall timeout for the entire request
- Retry: Retries on transient failures
- Circuit Breaker: Blocks execution after too many failures
- Attempt Timeout: Timeout for each individual attempt
Strategy Flow¶
graph LR
A[Request] --> B[Bulkhead]
B --> C[Total Request Timeout]
C --> D[Retry]
D --> E[Circuit Breaker]
E --> F[Attempt Timeout]
F --> G[HTTP Request]
Configuration¶
{
"MyService": {
"EnableHttpStandardResilience": true,
"HttpStandardResilience": {
"TotalRequestTimeout": {
"Timeout": "00:00:30"
},
"Retry": {
"MaxRetryAttempts": 3,
"BackoffType": "Constant",
"UseJitter": false,
"Delay": "00:00:02",
"MaxDelay": null
},
"CircuitBreaker": {
"FailureRatio": 0.1,
"MinimumThroughput": 2,
"SamplingDuration": "00:00:30",
"BreakDuration": "00:00:05"
},
"AttemptTimeout": {
"Timeout": "00:00:10"
},
"RateLimiter": {
"DefaultRateLimiterOptions": {
"PermitLimit": 1000,
"QueueLimit": 0,
"QueueProcessingOrder": "OldestFirst"
}
}
}
}
}
Strategy Details¶
Bulkhead¶
Limits the maximum number of concurrent requests to prevent resource exhaustion.
Configuration:
- Not directly configurable in standard resilience handler (handled by rate limiter)
Use Case: Prevent overwhelming the API with too many concurrent requests.
Total Request Timeout¶
Applies an overall timeout to the entire request execution, including all retries.
Configuration:
Use Case: Ensure requests don't hang indefinitely, even with retries.
Retry¶
Retries the request in case of transient errors.
Configuration:
"Retry": {
"MaxRetryAttempts": 3,
"BackoffType": "Constant", // or "Exponential"
"UseJitter": false,
"Delay": "00:00:02",
"MaxDelay": null
}
Backoff Types:
- Constant: Fixed delay between retries
- Exponential: Exponential backoff with increasing delays
Jitter: Adds randomness to prevent thundering herd problems.
Use Case: Handle transient network errors, temporary API unavailability.
Circuit Breaker¶
Blocks execution if too many failures are detected, preventing cascading failures.
Configuration:
"CircuitBreaker": {
"FailureRatio": 0.1, // 10% failure rate threshold
"MinimumThroughput": 2, // Minimum requests before opening
"SamplingDuration": "00:00:30", // Time window for sampling
"BreakDuration": "00:00:05" // Duration to keep circuit open
}
States:
- Closed: Normal operation
- Open: Circuit is open, requests are blocked
- Half-Open: Testing if service has recovered
Use Case: Prevent overwhelming a failing API, allow time for recovery.
Attempt Timeout¶
Limits each individual request attempt duration.
Configuration:
Use Case: Prevent individual attempts from hanging.
Rate Limiter¶
Limits the rate of requests to comply with API rate policies.
Configuration:
"RateLimiter": {
"DefaultRateLimiterOptions": {
"PermitLimit": 1000, // Maximum permits
"QueueLimit": 0, // Queue size (0 = no queue)
"QueueProcessingOrder": "OldestFirst"
}
}
Use Case: Comply with API rate limits, prevent throttling.
Standard Hedging Handler¶
The standard hedging handler executes requests against multiple endpoints in parallel for slow dependencies.
Strategy Flow¶
graph LR
A[Request] --> B[Total Request Timeout]
B --> C[Hedging]
C --> D[Endpoint: Bulkhead]
D --> E[Endpoint: Circuit Breaker]
E --> F[Endpoint: Attempt Timeout]
F --> G[HTTP Request]
Configuration¶
{
"MyService": {
"EnableHttpStandardHedgingResilience": true,
"HttpStandardHedgingResilience": {
"TotalRequestTimeout": {
"Timeout": "00:00:30"
},
"Hedging": {
"Delay": "00:00:02",
"MaxHedgedAttempts": 1
},
"Endpoint": {
"RateLimiter": {
"DefaultRateLimiterOptions": {
"PermitLimit": 1000,
"QueueLimit": 0
}
},
"CircuitBreaker": {
"FailureRatio": 0.1,
"MinimumThroughput": 2,
"SamplingDuration": "00:00:30",
"BreakDuration": "00:00:05"
},
"Timeout": {
"Timeout": "00:00:10"
}
}
}
}
}
Strategy Details¶
Hedging¶
Executes multiple requests in parallel if the first request is slow.
Configuration:
"Hedging": {
"Delay": "00:00:02", // Wait 2 seconds before hedging
"MaxHedgedAttempts": 1 // Maximum parallel requests (1 = 2 total)
}
How It Works:
- Send first request
- If response not received within
Delay, send hedged request(s) - Return first successful response
- Cancel remaining requests
Use Case: Handle slow dependencies by trying multiple endpoints or instances in parallel.
Endpoint Strategies¶
Each endpoint has its own:
- Rate Limiter: Per-endpoint rate limiting
- Circuit Breaker: Per-endpoint circuit breaker
- Timeout: Per-endpoint attempt timeout
Use Case: Different endpoints may have different characteristics and failure rates.
Polly Integration¶
The template uses Polly.Core for advanced resilience patterns. While the standard handlers provide most functionality, you can add custom Polly policies if needed.
Custom Policies¶
You can add custom Polly policies in the Resilience/ folder:
public static class CustomPolicies
{
public static ResiliencePipeline<HttpResponseMessage> GetCustomPolicy()
{
return new ResiliencePipelineBuilder<HttpResponseMessage>()
.AddRetry(new RetryStrategyOptions<HttpResponseMessage>
{
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(r => r.StatusCode == HttpStatusCode.TooManyRequests),
MaxRetryAttempts = 3,
Delay = TimeSpan.FromSeconds(2)
})
.Build();
}
}
Integration¶
Custom policies can be added to the HTTP client builder:
httpClientBuilder.AddResilienceHandler("custom", builder =>
{
builder.AddPipeline(CustomPolicies.GetCustomPolicy());
});
Chaos Injection¶
Chaos injection allows you to test your library's resiliency by simulating failures and delays.
Overview¶
Chaos injection uses Polly's Simmy library to inject faults, latency, and outcomes into requests.
Configuration¶
{
"MyService": {
"EnableChaosInjection": true,
"ChaosInjection": {
"InjectionRate": 0.001, // 0.1% of requests
"Latency": "00:00:05" // 5 second delay
}
}
}
Chaos Strategies¶
Latency Injection¶
Adds delays to simulate slow network conditions.
Configuration:
"ChaosInjection": {
"InjectionRate": 0.1, // 10% of requests
"Latency": "00:00:05" // 5 second delay
}
Use Case: Test timeout handling, retry mechanisms.
Fault Injection¶
Simulates exceptions and errors.
Implementation (automatic):
builder.AddChaosFault(
injectionRate: 0.001,
fault: () => new InvalidOperationException("Chaos strategy injection!"));
Use Case: Test exception handling, circuit breaker behavior.
Outcome Injection¶
Simulates HTTP error responses.
Implementation (automatic):
builder.AddChaosOutcome(
injectionRate: 0.001,
outcome: () => new HttpResponseMessage(HttpStatusCode.InternalServerError));
Use Case: Test error response handling, retry logic.
Chaos Testing Workflow¶
- Enable Chaos Injection: Set
EnableChaosInjection: truein configuration. - Configure Injection Rate: Set
InjectionRate(0.0 to 1.0). - Configure Latency: Set
Latencyfor delay injection. - Run Tests: Execute your tests and observe behavior.
- Validate Resiliency: Verify retry, circuit breaker, and timeout mechanisms work correctly.
Example Configuration¶
{
"MyService": {
"EnableChaosInjection": true,
"ChaosInjection": {
"InjectionRate": 0.20, // 20% of requests
"Latency": "00:00:05" // 5 second delay
},
"EnableHttpStandardResilience": true,
"HttpStandardResilience": {
"Retry": {
"MaxRetryAttempts": 3,
"Delay": "00:00:02"
},
"CircuitBreaker": {
"FailureRatio": 0.1,
"BreakDuration": "00:00:05"
}
}
}
}
Chaos Testing Best Practices¶
- Start Small: Begin with low injection rates (0.1% - 1%).
- Gradual Increase: Gradually increase injection rates to test limits.
- Monitor Metrics: Watch metrics for failure rates and recovery times.
- Test Different Scenarios: Test latency, faults, and outcomes separately.
- Validate Recovery: Ensure systems recover after chaos injection stops.
Resiliency Best Practices¶
Configuration Guidelines¶
-
Timeout Hierarchy:
- Attempt Timeout < Total Request Timeout
- Account for retries in Total Request Timeout
-
Retry Configuration:
- Use exponential backoff for distributed systems
- Add jitter to prevent thundering herd
- Limit max retry attempts (typically 3-5)
-
Circuit Breaker:
- Set appropriate failure ratio (typically 0.1 - 0.5)
- Configure minimum throughput to avoid false positives
- Set break duration to allow recovery
-
Rate Limiting:
- Configure based on API rate limits
- Use queue for non-critical requests
- Monitor for rate limit violations
When to Use Standard Resilience vs Hedging¶
Use Standard Resilience When:
- Single endpoint or service
- Retry-based recovery is sufficient
- Predictable failure patterns
Use Hedging When:
- Multiple endpoints or instances available
- Slow responses are common
- Parallel execution improves reliability
Monitoring and Observability¶
- Metrics: Track retry counts, circuit breaker state, timeout occurrences
- Logging: Log resilience events (retries, circuit breaker opens/closes)
- Alerts: Alert on high failure rates, circuit breaker openings
Troubleshooting¶
Common Issues¶
Too Many Retries¶
Problem: Requests are retried too many times, causing delays.
Solutions:
- Reduce
MaxRetryAttempts - Increase
Delaybetween retries - Check if errors are truly transient
Circuit Breaker Opens Too Often¶
Problem: Circuit breaker opens unnecessarily.
Solutions:
- Increase
FailureRatiothreshold - Increase
MinimumThroughput - Check if failures are transient or permanent
Timeouts Too Short¶
Problem: Requests timeout before completion.
Solutions:
- Increase
AttemptTimeout - Increase
TotalRequestTimeout - Check network latency and API response times
Rate Limiting Issues¶
Problem: Rate limiter blocks too many requests.
Solutions:
- Increase
PermitLimit - Add
QueueLimitfor queuing - Check API rate limit policies
Configuration Examples¶
Production Configuration¶
{
"MyService": {
"EnableHttpStandardResilience": true,
"HttpStandardResilience": {
"TotalRequestTimeout": { "Timeout": "00:00:30" },
"Retry": {
"MaxRetryAttempts": 3,
"BackoffType": "Exponential",
"UseJitter": true,
"Delay": "00:00:01",
"MaxDelay": "00:00:10"
},
"CircuitBreaker": {
"FailureRatio": 0.1,
"MinimumThroughput": 10,
"SamplingDuration": "00:01:00",
"BreakDuration": "00:00:30"
},
"AttemptTimeout": { "Timeout": "00:00:10" },
"RateLimiter": {
"DefaultRateLimiterOptions": {
"PermitLimit": 100,
"QueueLimit": 10
}
}
},
"EnableChaosInjection": false
}
}
Testing Configuration (with Chaos)¶
{
"MyService": {
"EnableHttpStandardResilience": true,
"HttpStandardResilience": {
"TotalRequestTimeout": { "Timeout": "00:01:00" },
"Retry": {
"MaxRetryAttempts": 5,
"BackoffType": "Constant",
"Delay": "00:00:01"
},
"CircuitBreaker": {
"FailureRatio": 0.2,
"MinimumThroughput": 2,
"BreakDuration": "00:00:05"
}
},
"EnableChaosInjection": true,
"ChaosInjection": {
"InjectionRate": 0.1,
"Latency": "00:00:03"
}
}
}
Conclusion¶
The ConnectSoft API Library Template provides comprehensive resiliency patterns and chaos engineering capabilities to ensure reliable API communication. By configuring appropriate resilience strategies and using chaos injection for testing, you can build robust API client libraries that handle failures gracefully.
For more information, see:
- Configuration Guide - Detailed configuration options
- Features Guide - Resiliency features overview
- Testing Guide - Chaos testing strategies