Skip to content

Observability Scaling

This document covers strategies for scaling observability infrastructure.

Scaling Dimensions

Horizontal Scaling

Add more instances to handle increased load.

Benefits:

  • Linear scaling
  • No downtime
  • Better fault tolerance

Challenges:

  • State management
  • Load distribution
  • Coordination

Vertical Scaling

Increase resources of existing instances.

Benefits:

  • Simple
  • No coordination needed
  • Lower complexity

Challenges:

  • Limited by hardware
  • Downtime for scaling
  • Cost at scale

Collector Scaling

Horizontal Scaling Pattern

Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 3  # Scale based on load
  strategy:
    type: RollingUpdate

Auto-scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-collector-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-collector
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Load Balancing

Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  type: ClusterIP
  ports:
    - port: 4317
      targetPort: 4317
  selector:
    app: otel-collector

Performance Tuning

Batch Processing:

processors:
  batch:
    timeout: 5s  # Increase for better batching
    send_batch_size: 2048
    send_batch_max_size: 4096

Memory Management:

processors:
  memory_limiter:
    limit_mib: 400
    spike_limit_mib: 100
    check_interval: 5s

Backend Scaling

Prometheus Scaling

Federation Pattern:

# Global Prometheus
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="otel-collector"}'
    static_configs:
      - targets:
          - 'prometheus-region-1:9090'
          - 'prometheus-region-2:9090'

Thanos for Long-term Storage:

  • Global query view
  • Long-term retention
  • Deduplication
  • Compression

Elasticsearch Scaling

Cluster Configuration:

# Elasticsearch cluster with 3 nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
spec:
  replicas: 3
  serviceName: elasticsearch
  template:
    spec:
      containers:
        - name: elasticsearch
          env:
            - name: discovery.seed_hosts
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
            - name: cluster.initial_master_nodes
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"

Sharding Strategy:

  • Primary shards: 3-5 per index
  • Replica shards: 1-2 per primary
  • Index per day/week

Jaeger Scaling

Storage Backend Scaling:

  • Memory: Not scalable (single instance)
  • Elasticsearch: Scale Elasticsearch cluster
  • Cassandra: Scale Cassandra cluster

Collector Scaling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
spec:
  replicas: 3

Capacity Planning

Metrics Collection

Estimate:

  • Metrics per service: 100-1000
  • Services: 10-100
  • Total metrics: 1,000-100,000
  • Sample rate: 15s
  • Storage: ~1KB per metric per sample

Calculation:

Storage per day = metrics × samples_per_day × size_per_sample
Storage per day = 10,000 × 5,760 × 1KB = ~55GB

Trace Collection

Estimate:

  • Traces per request: 1
  • Requests per second: 1000
  • Spans per trace: 10
  • Storage: ~1KB per span

Calculation:

Spans per day = requests_per_second × spans_per_trace × seconds_per_day
Spans per day = 1,000 × 10 × 86,400 = 864,000,000
Storage per day = 864,000,000 × 1KB = ~824GB

Log Collection

Estimate:

  • Logs per service: 100-1000 per minute
  • Services: 10-100
  • Storage: ~500 bytes per log

Calculation:

Logs per day = services × logs_per_minute × minutes_per_day
Logs per day = 50 × 500 × 1,440 = 36,000,000
Storage per day = 36,000,000 × 500 bytes = ~17GB

Scaling Strategies

Proactive Scaling

Based on Predictable Patterns:

  • Time-based scaling (business hours)
  • Event-based scaling (releases)
  • Calendar-based scaling (holidays)

Reactive Scaling

Based on Metrics:

  • CPU utilization
  • Memory usage
  • Queue depth
  • Error rates

Predictive Scaling

Based on ML Models:

  • Predict future load
  • Scale before demand
  • Optimize resource usage

Cost Optimization

Data Volume Reduction

Sampling:

processors:
  probabilistic_sampler:
    sampling_percentage: 10.0  # Sample 10%

Filtering:

processors:
  filter:
    traces:
      span:
        - 'attributes["http.status_code"] == 200'

Aggregation:

  • Pre-aggregate metrics
  • Reduce cardinality
  • Use histograms

Resource Optimization

Right-sizing:

  • Match resources to workload
  • Regular reviews
  • Adjust based on metrics

Auto-scaling:

  • Scale down during low usage
  • Scale up during high usage
  • Use spot instances for non-critical

Monitoring Scaling

Key Metrics

Collector:

  • otelcol_receiver_accepted_spans
  • otelcol_exporter_send_failed_spans
  • otelcol_processor_batch_batch_send_size
  • CPU and memory usage

Backends:

  • Ingestion rate
  • Storage usage
  • Query latency
  • Error rates

Alerts

Scale-up Alerts:

  • High CPU utilization (>80%)
  • High memory usage (>80%)
  • Queue depth increasing
  • Error rates increasing

Scale-down Alerts:

  • Low CPU utilization (<30%)
  • Low memory usage (<30%)
  • Low ingestion rate
  • No errors

Further Reading