Skip to content

Observability Scaling

This document covers strategies for scaling observability infrastructure.

Scaling Dimensions

Horizontal Scaling

Add more instances to handle increased load.

Benefits: - Linear scaling - No downtime - Better fault tolerance

Challenges: - State management - Load distribution - Coordination

Vertical Scaling

Increase resources of existing instances.

Benefits: - Simple - No coordination needed - Lower complexity

Challenges: - Limited by hardware - Downtime for scaling - Cost at scale

Collector Scaling

Horizontal Scaling Pattern

Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 3  # Scale based on load
  strategy:
    type: RollingUpdate

Auto-scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-collector-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-collector
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Load Balancing

Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  type: ClusterIP
  ports:
    - port: 4317
      targetPort: 4317
  selector:
    app: otel-collector

Performance Tuning

Batch Processing:

processors:
  batch:
    timeout: 5s  # Increase for better batching
    send_batch_size: 2048
    send_batch_max_size: 4096

Memory Management:

processors:
  memory_limiter:
    limit_mib: 400
    spike_limit_mib: 100
    check_interval: 5s

Backend Scaling

Prometheus Scaling

Federation Pattern:

# Global Prometheus
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="otel-collector"}'
    static_configs:
      - targets:
          - 'prometheus-region-1:9090'
          - 'prometheus-region-2:9090'

Thanos for Long-term Storage: - Global query view - Long-term retention - Deduplication - Compression

Elasticsearch Scaling

Cluster Configuration:

# Elasticsearch cluster with 3 nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
spec:
  replicas: 3
  serviceName: elasticsearch
  template:
    spec:
      containers:
        - name: elasticsearch
          env:
            - name: discovery.seed_hosts
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
            - name: cluster.initial_master_nodes
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"

Sharding Strategy: - Primary shards: 3-5 per index - Replica shards: 1-2 per primary - Index per day/week

Jaeger Scaling

Storage Backend Scaling: - Memory: Not scalable (single instance) - Elasticsearch: Scale Elasticsearch cluster - Cassandra: Scale Cassandra cluster

Collector Scaling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
spec:
  replicas: 3

Capacity Planning

Metrics Collection

Estimate: - Metrics per service: 100-1000 - Services: 10-100 - Total metrics: 1,000-100,000 - Sample rate: 15s - Storage: ~1KB per metric per sample

Calculation:

Storage per day = metrics × samples_per_day × size_per_sample
Storage per day = 10,000 × 5,760 × 1KB = ~55GB

Trace Collection

Estimate: - Traces per request: 1 - Requests per second: 1000 - Spans per trace: 10 - Storage: ~1KB per span

Calculation:

Spans per day = requests_per_second × spans_per_trace × seconds_per_day
Spans per day = 1,000 × 10 × 86,400 = 864,000,000
Storage per day = 864,000,000 × 1KB = ~824GB

Log Collection

Estimate: - Logs per service: 100-1000 per minute - Services: 10-100 - Storage: ~500 bytes per log

Calculation:

Logs per day = services × logs_per_minute × minutes_per_day
Logs per day = 50 × 500 × 1,440 = 36,000,000
Storage per day = 36,000,000 × 500 bytes = ~17GB

Scaling Strategies

Proactive Scaling

Based on Predictable Patterns: - Time-based scaling (business hours) - Event-based scaling (releases) - Calendar-based scaling (holidays)

Reactive Scaling

Based on Metrics: - CPU utilization - Memory usage - Queue depth - Error rates

Predictive Scaling

Based on ML Models: - Predict future load - Scale before demand - Optimize resource usage

Cost Optimization

Data Volume Reduction

Sampling:

processors:
  probabilistic_sampler:
    sampling_percentage: 10.0  # Sample 10%

Filtering:

processors:
  filter:
    traces:
      span:
        - 'attributes["http.status_code"] == 200'

Aggregation: - Pre-aggregate metrics - Reduce cardinality - Use histograms

Resource Optimization

Right-sizing: - Match resources to workload - Regular reviews - Adjust based on metrics

Auto-scaling: - Scale down during low usage - Scale up during high usage - Use spot instances for non-critical

Monitoring Scaling

Key Metrics

Collector: - otelcol_receiver_accepted_spans - otelcol_exporter_send_failed_spans - otelcol_processor_batch_batch_send_size - CPU and memory usage

Backends: - Ingestion rate - Storage usage - Query latency - Error rates

Alerts

Scale-up Alerts: - High CPU utilization (>80%) - High memory usage (>80%) - Queue depth increasing - Error rates increasing

Scale-down Alerts: - Low CPU utilization (<30%) - Low memory usage (<30%) - Low ingestion rate - No errors

Further Reading