Observability Scaling¶

This document covers strategies for scaling observability infrastructure.

Scaling Dimensions¶

Horizontal Scaling¶

Add more instances to handle increased load.

Benefits: - Linear scaling - No downtime - Better fault tolerance

Challenges: - State management - Load distribution - Coordination

Vertical Scaling¶

Increase resources of existing instances.

Benefits: - Simple - No coordination needed - Lower complexity

Challenges: - Limited by hardware - Downtime for scaling - Cost at scale

Collector Scaling¶

Horizontal Scaling Pattern¶

Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 3  # Scale based on load
  strategy:
    type: RollingUpdate

Auto-scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-collector-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-collector
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Load Balancing¶

Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  type: ClusterIP
  ports:
    - port: 4317
      targetPort: 4317
  selector:
    app: otel-collector

Performance Tuning¶

Batch Processing:

processors:
  batch:
    timeout: 5s  # Increase for better batching
    send_batch_size: 2048
    send_batch_max_size: 4096

Memory Management:

processors:
  memory_limiter:
    limit_mib: 400
    spike_limit_mib: 100
    check_interval: 5s

Backend Scaling¶

Prometheus Scaling¶

Federation Pattern:

# Global Prometheus
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="otel-collector"}'
    static_configs:
      - targets:
          - 'prometheus-region-1:9090'
          - 'prometheus-region-2:9090'

Thanos for Long-term Storage: - Global query view - Long-term retention - Deduplication - Compression

Elasticsearch Scaling¶

Cluster Configuration:

# Elasticsearch cluster with 3 nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
spec:
  replicas: 3
  serviceName: elasticsearch
  template:
    spec:
      containers:
        - name: elasticsearch
          env:
            - name: discovery.seed_hosts
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
            - name: cluster.initial_master_nodes
              value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"

Sharding Strategy: - Primary shards: 3-5 per index - Replica shards: 1-2 per primary - Index per day/week

Jaeger Scaling¶

Storage Backend Scaling: - Memory: Not scalable (single instance) - Elasticsearch: Scale Elasticsearch cluster - Cassandra: Scale Cassandra cluster

Collector Scaling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
spec:
  replicas: 3

Capacity Planning¶

Metrics Collection¶

Estimate: - Metrics per service: 100-1000 - Services: 10-100 - Total metrics: 1,000-100,000 - Sample rate: 15s - Storage: ~1KB per metric per sample

Calculation:

Storage per day = metrics × samples_per_day × size_per_sample
Storage per day = 10,000 × 5,760 × 1KB = ~55GB

Trace Collection¶

Estimate: - Traces per request: 1 - Requests per second: 1000 - Spans per trace: 10 - Storage: ~1KB per span

Calculation:

Spans per day = requests_per_second × spans_per_trace × seconds_per_day
Spans per day = 1,000 × 10 × 86,400 = 864,000,000
Storage per day = 864,000,000 × 1KB = ~824GB

Log Collection¶

Estimate: - Logs per service: 100-1000 per minute - Services: 10-100 - Storage: ~500 bytes per log

Calculation:

Logs per day = services × logs_per_minute × minutes_per_day
Logs per day = 50 × 500 × 1,440 = 36,000,000
Storage per day = 36,000,000 × 500 bytes = ~17GB

Scaling Strategies¶

Proactive Scaling¶

Based on Predictable Patterns: - Time-based scaling (business hours) - Event-based scaling (releases) - Calendar-based scaling (holidays)

Reactive Scaling¶

Based on Metrics: - CPU utilization - Memory usage - Queue depth - Error rates

Predictive Scaling¶

Based on ML Models: - Predict future load - Scale before demand - Optimize resource usage

Cost Optimization¶

Data Volume Reduction¶

Sampling:

processors:
  probabilistic_sampler:
    sampling_percentage: 10.0  # Sample 10%

Filtering:

processors:
  filter:
    traces:
      span:
        - 'attributes["http.status_code"] == 200'

Aggregation: - Pre-aggregate metrics - Reduce cardinality - Use histograms

Resource Optimization¶

Right-sizing: - Match resources to workload - Regular reviews - Adjust based on metrics

Auto-scaling: - Scale down during low usage - Scale up during high usage - Use spot instances for non-critical

Monitoring Scaling¶

Key Metrics¶

Collector: - otelcol_receiver_accepted_spans - otelcol_exporter_send_failed_spans - otelcol_processor_batch_batch_send_size - CPU and memory usage

Backends: - Ingestion rate - Storage usage - Query latency - Error rates

Alerts¶

Scale-up Alerts: - High CPU utilization (>80%) - High memory usage (>80%) - Queue depth increasing - Error rates increasing

Scale-down Alerts: - Low CPU utilization (<30%) - Low memory usage (<30%) - Low ingestion rate - No errors

Observability Scaling¶

Scaling Dimensions¶

Horizontal Scaling¶

Vertical Scaling¶

Collector Scaling¶

Horizontal Scaling Pattern¶

Load Balancing¶

Performance Tuning¶

Backend Scaling¶

Prometheus Scaling¶

Elasticsearch Scaling¶

Jaeger Scaling¶

Capacity Planning¶

Metrics Collection¶

Trace Collection¶

Log Collection¶

Scaling Strategies¶

Proactive Scaling¶

Reactive Scaling¶

Predictive Scaling¶

Cost Optimization¶

Data Volume Reduction¶

Resource Optimization¶

Monitoring Scaling¶

Key Metrics¶

Alerts¶

Further Reading¶