Skip to content

Resiliency and Chaos Engineering Guide: ConnectSoft API Library Template

The ConnectSoft API Library Template includes comprehensive resiliency patterns and chaos engineering capabilities to ensure reliable API communication under various failure conditions. This guide explains the resiliency mechanisms, their configuration, and how to use chaos injection for testing.

Resiliency Patterns Overview

The template provides two main resiliency patterns:

  1. Standard Resilience Handler: Retry-based resilience with circuit breaker, timeout, bulkhead, and rate limiter.
  2. Standard Hedging Handler: Parallel request execution for slow dependencies with endpoint-specific strategies.

Both patterns are built on Microsoft.Extensions.Http.Resilience and Polly.Core, providing industry-standard fault tolerance mechanisms.

Standard Resilience Handler

The standard resilience handler provides a layered approach to fault tolerance with five chained strategies (from outermost to innermost):

  1. Bulkhead: Limits concurrent requests
  2. Total Request Timeout: Overall timeout for the entire request
  3. Retry: Retries on transient failures
  4. Circuit Breaker: Blocks execution after too many failures
  5. Attempt Timeout: Timeout for each individual attempt

Strategy Flow

graph LR
    A[Request] --> B[Bulkhead]
    B --> C[Total Request Timeout]
    C --> D[Retry]
    D --> E[Circuit Breaker]
    E --> F[Attempt Timeout]
    F --> G[HTTP Request]
Hold "Alt" / "Option" to enable pan & zoom

Configuration

{
  "MyService": {
    "EnableHttpStandardResilience": true,
    "HttpStandardResilience": {
      "TotalRequestTimeout": {
        "Timeout": "00:00:30"
      },
      "Retry": {
        "MaxRetryAttempts": 3,
        "BackoffType": "Constant",
        "UseJitter": false,
        "Delay": "00:00:02",
        "MaxDelay": null
      },
      "CircuitBreaker": {
        "FailureRatio": 0.1,
        "MinimumThroughput": 2,
        "SamplingDuration": "00:00:30",
        "BreakDuration": "00:00:05"
      },
      "AttemptTimeout": {
        "Timeout": "00:00:10"
      },
      "RateLimiter": {
        "DefaultRateLimiterOptions": {
          "PermitLimit": 1000,
          "QueueLimit": 0,
          "QueueProcessingOrder": "OldestFirst"
        }
      }
    }
  }
}

Strategy Details

Bulkhead

Limits the maximum number of concurrent requests to prevent resource exhaustion.

Configuration:

  • Not directly configurable in standard resilience handler (handled by rate limiter)

Use Case: Prevent overwhelming the API with too many concurrent requests.

Total Request Timeout

Applies an overall timeout to the entire request execution, including all retries.

Configuration:

"TotalRequestTimeout": {
  "Timeout": "00:00:30"  // 30 seconds total
}

Use Case: Ensure requests don't hang indefinitely, even with retries.

Retry

Retries the request in case of transient errors.

Configuration:

"Retry": {
  "MaxRetryAttempts": 3,
  "BackoffType": "Constant",  // or "Exponential"
  "UseJitter": false,
  "Delay": "00:00:02",
  "MaxDelay": null
}

Backoff Types:

  • Constant: Fixed delay between retries
  • Exponential: Exponential backoff with increasing delays

Jitter: Adds randomness to prevent thundering herd problems.

Use Case: Handle transient network errors, temporary API unavailability.

Circuit Breaker

Blocks execution if too many failures are detected, preventing cascading failures.

Configuration:

"CircuitBreaker": {
  "FailureRatio": 0.1,           // 10% failure rate threshold
  "MinimumThroughput": 2,        // Minimum requests before opening
  "SamplingDuration": "00:00:30", // Time window for sampling
  "BreakDuration": "00:00:05"    // Duration to keep circuit open
}

States:

  • Closed: Normal operation
  • Open: Circuit is open, requests are blocked
  • Half-Open: Testing if service has recovered

Use Case: Prevent overwhelming a failing API, allow time for recovery.

Attempt Timeout

Limits each individual request attempt duration.

Configuration:

"AttemptTimeout": {
  "Timeout": "00:00:10"  // 10 seconds per attempt
}

Use Case: Prevent individual attempts from hanging.

Rate Limiter

Limits the rate of requests to comply with API rate policies.

Configuration:

"RateLimiter": {
  "DefaultRateLimiterOptions": {
    "PermitLimit": 1000,              // Maximum permits
    "QueueLimit": 0,                  // Queue size (0 = no queue)
    "QueueProcessingOrder": "OldestFirst"
  }
}

Use Case: Comply with API rate limits, prevent throttling.

Standard Hedging Handler

The standard hedging handler executes requests against multiple endpoints in parallel for slow dependencies.

Strategy Flow

graph LR
    A[Request] --> B[Total Request Timeout]
    B --> C[Hedging]
    C --> D[Endpoint: Bulkhead]
    D --> E[Endpoint: Circuit Breaker]
    E --> F[Endpoint: Attempt Timeout]
    F --> G[HTTP Request]
Hold "Alt" / "Option" to enable pan & zoom

Configuration

{
  "MyService": {
    "EnableHttpStandardHedgingResilience": true,
    "HttpStandardHedgingResilience": {
      "TotalRequestTimeout": {
        "Timeout": "00:00:30"
      },
      "Hedging": {
        "Delay": "00:00:02",
        "MaxHedgedAttempts": 1
      },
      "Endpoint": {
        "RateLimiter": {
          "DefaultRateLimiterOptions": {
            "PermitLimit": 1000,
            "QueueLimit": 0
          }
        },
        "CircuitBreaker": {
          "FailureRatio": 0.1,
          "MinimumThroughput": 2,
          "SamplingDuration": "00:00:30",
          "BreakDuration": "00:00:05"
        },
        "Timeout": {
          "Timeout": "00:00:10"
        }
      }
    }
  }
}

Strategy Details

Hedging

Executes multiple requests in parallel if the first request is slow.

Configuration:

"Hedging": {
  "Delay": "00:00:02",        // Wait 2 seconds before hedging
  "MaxHedgedAttempts": 1      // Maximum parallel requests (1 = 2 total)
}

How It Works:

  1. Send first request
  2. If response not received within Delay, send hedged request(s)
  3. Return first successful response
  4. Cancel remaining requests

Use Case: Handle slow dependencies by trying multiple endpoints or instances in parallel.

Endpoint Strategies

Each endpoint has its own:

  • Rate Limiter: Per-endpoint rate limiting
  • Circuit Breaker: Per-endpoint circuit breaker
  • Timeout: Per-endpoint attempt timeout

Use Case: Different endpoints may have different characteristics and failure rates.

Polly Integration

The template uses Polly.Core for advanced resilience patterns. While the standard handlers provide most functionality, you can add custom Polly policies if needed.

Custom Policies

You can add custom Polly policies in the Resilience/ folder:

public static class CustomPolicies
{
    public static ResiliencePipeline<HttpResponseMessage> GetCustomPolicy()
    {
        return new ResiliencePipelineBuilder<HttpResponseMessage>()
            .AddRetry(new RetryStrategyOptions<HttpResponseMessage>
            {
                ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                    .Handle<HttpRequestException>()
                    .HandleResult(r => r.StatusCode == HttpStatusCode.TooManyRequests),
                MaxRetryAttempts = 3,
                Delay = TimeSpan.FromSeconds(2)
            })
            .Build();
    }
}

Integration

Custom policies can be added to the HTTP client builder:

httpClientBuilder.AddResilienceHandler("custom", builder =>
{
    builder.AddPipeline(CustomPolicies.GetCustomPolicy());
});

Chaos Injection

Chaos injection allows you to test your library's resiliency by simulating failures and delays.

Overview

Chaos injection uses Polly's Simmy library to inject faults, latency, and outcomes into requests.

Configuration

{
  "MyService": {
    "EnableChaosInjection": true,
    "ChaosInjection": {
      "InjectionRate": 0.001,  // 0.1% of requests
      "Latency": "00:00:05"     // 5 second delay
    }
  }
}

Chaos Strategies

Latency Injection

Adds delays to simulate slow network conditions.

Configuration:

"ChaosInjection": {
  "InjectionRate": 0.1,        // 10% of requests
  "Latency": "00:00:05"        // 5 second delay
}

Use Case: Test timeout handling, retry mechanisms.

Fault Injection

Simulates exceptions and errors.

Implementation (automatic):

builder.AddChaosFault(
    injectionRate: 0.001,
    fault: () => new InvalidOperationException("Chaos strategy injection!"));

Use Case: Test exception handling, circuit breaker behavior.

Outcome Injection

Simulates HTTP error responses.

Implementation (automatic):

builder.AddChaosOutcome(
    injectionRate: 0.001,
    outcome: () => new HttpResponseMessage(HttpStatusCode.InternalServerError));

Use Case: Test error response handling, retry logic.

Chaos Testing Workflow

  1. Enable Chaos Injection: Set EnableChaosInjection: true in configuration.
  2. Configure Injection Rate: Set InjectionRate (0.0 to 1.0).
  3. Configure Latency: Set Latency for delay injection.
  4. Run Tests: Execute your tests and observe behavior.
  5. Validate Resiliency: Verify retry, circuit breaker, and timeout mechanisms work correctly.

Example Configuration

{
  "MyService": {
    "EnableChaosInjection": true,
    "ChaosInjection": {
      "InjectionRate": 0.20,      // 20% of requests
      "Latency": "00:00:05"       // 5 second delay
    },
    "EnableHttpStandardResilience": true,
    "HttpStandardResilience": {
      "Retry": {
        "MaxRetryAttempts": 3,
        "Delay": "00:00:02"
      },
      "CircuitBreaker": {
        "FailureRatio": 0.1,
        "BreakDuration": "00:00:05"
      }
    }
  }
}

Chaos Testing Best Practices

  1. Start Small: Begin with low injection rates (0.1% - 1%).
  2. Gradual Increase: Gradually increase injection rates to test limits.
  3. Monitor Metrics: Watch metrics for failure rates and recovery times.
  4. Test Different Scenarios: Test latency, faults, and outcomes separately.
  5. Validate Recovery: Ensure systems recover after chaos injection stops.

Resiliency Best Practices

Configuration Guidelines

  1. Timeout Hierarchy:

    • Attempt Timeout < Total Request Timeout
    • Account for retries in Total Request Timeout
  2. Retry Configuration:

    • Use exponential backoff for distributed systems
    • Add jitter to prevent thundering herd
    • Limit max retry attempts (typically 3-5)
  3. Circuit Breaker:

    • Set appropriate failure ratio (typically 0.1 - 0.5)
    • Configure minimum throughput to avoid false positives
    • Set break duration to allow recovery
  4. Rate Limiting:

    • Configure based on API rate limits
    • Use queue for non-critical requests
    • Monitor for rate limit violations

When to Use Standard Resilience vs Hedging

Use Standard Resilience When:

  • Single endpoint or service
  • Retry-based recovery is sufficient
  • Predictable failure patterns

Use Hedging When:

  • Multiple endpoints or instances available
  • Slow responses are common
  • Parallel execution improves reliability

Monitoring and Observability

  1. Metrics: Track retry counts, circuit breaker state, timeout occurrences
  2. Logging: Log resilience events (retries, circuit breaker opens/closes)
  3. Alerts: Alert on high failure rates, circuit breaker openings

Troubleshooting

Common Issues

Too Many Retries

Problem: Requests are retried too many times, causing delays.

Solutions:

  • Reduce MaxRetryAttempts
  • Increase Delay between retries
  • Check if errors are truly transient

Circuit Breaker Opens Too Often

Problem: Circuit breaker opens unnecessarily.

Solutions:

  • Increase FailureRatio threshold
  • Increase MinimumThroughput
  • Check if failures are transient or permanent

Timeouts Too Short

Problem: Requests timeout before completion.

Solutions:

  • Increase AttemptTimeout
  • Increase TotalRequestTimeout
  • Check network latency and API response times

Rate Limiting Issues

Problem: Rate limiter blocks too many requests.

Solutions:

  • Increase PermitLimit
  • Add QueueLimit for queuing
  • Check API rate limit policies

Configuration Examples

Production Configuration

{
  "MyService": {
    "EnableHttpStandardResilience": true,
    "HttpStandardResilience": {
      "TotalRequestTimeout": { "Timeout": "00:00:30" },
      "Retry": {
        "MaxRetryAttempts": 3,
        "BackoffType": "Exponential",
        "UseJitter": true,
        "Delay": "00:00:01",
        "MaxDelay": "00:00:10"
      },
      "CircuitBreaker": {
        "FailureRatio": 0.1,
        "MinimumThroughput": 10,
        "SamplingDuration": "00:01:00",
        "BreakDuration": "00:00:30"
      },
      "AttemptTimeout": { "Timeout": "00:00:10" },
      "RateLimiter": {
        "DefaultRateLimiterOptions": {
          "PermitLimit": 100,
          "QueueLimit": 10
        }
      }
    },
    "EnableChaosInjection": false
  }
}

Testing Configuration (with Chaos)

{
  "MyService": {
    "EnableHttpStandardResilience": true,
    "HttpStandardResilience": {
      "TotalRequestTimeout": { "Timeout": "00:01:00" },
      "Retry": {
        "MaxRetryAttempts": 5,
        "BackoffType": "Constant",
        "Delay": "00:00:01"
      },
      "CircuitBreaker": {
        "FailureRatio": 0.2,
        "MinimumThroughput": 2,
        "BreakDuration": "00:00:05"
      }
    },
    "EnableChaosInjection": true,
    "ChaosInjection": {
      "InjectionRate": 0.1,
      "Latency": "00:00:03"
    }
  }
}

Conclusion

The ConnectSoft API Library Template provides comprehensive resiliency patterns and chaos engineering capabilities to ensure reliable API communication. By configuring appropriate resilience strategies and using chaos injection for testing, you can build robust API client libraries that handle failures gracefully.

For more information, see: