Observability Scaling¶
This document covers strategies for scaling observability infrastructure.
Scaling Dimensions¶
Horizontal Scaling¶
Add more instances to handle increased load.
Benefits: - Linear scaling - No downtime - Better fault tolerance
Challenges: - State management - Load distribution - Coordination
Vertical Scaling¶
Increase resources of existing instances.
Benefits: - Simple - No coordination needed - Lower complexity
Challenges: - Limited by hardware - Downtime for scaling - Cost at scale
Collector Scaling¶
Horizontal Scaling Pattern¶
Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 3 # Scale based on load
strategy:
type: RollingUpdate
Auto-scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: otel-collector-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: otel-collector
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Load Balancing¶
Service Configuration:
apiVersion: v1
kind: Service
metadata:
name: otel-collector
spec:
type: ClusterIP
ports:
- port: 4317
targetPort: 4317
selector:
app: otel-collector
Performance Tuning¶
Batch Processing:
processors:
batch:
timeout: 5s # Increase for better batching
send_batch_size: 2048
send_batch_max_size: 4096
Memory Management:
Backend Scaling¶
Prometheus Scaling¶
Federation Pattern:
# Global Prometheus
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="otel-collector"}'
static_configs:
- targets:
- 'prometheus-region-1:9090'
- 'prometheus-region-2:9090'
Thanos for Long-term Storage: - Global query view - Long-term retention - Deduplication - Compression
Elasticsearch Scaling¶
Cluster Configuration:
# Elasticsearch cluster with 3 nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
spec:
replicas: 3
serviceName: elasticsearch
template:
spec:
containers:
- name: elasticsearch
env:
- name: discovery.seed_hosts
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
Sharding Strategy: - Primary shards: 3-5 per index - Replica shards: 1-2 per primary - Index per day/week
Jaeger Scaling¶
Storage Backend Scaling: - Memory: Not scalable (single instance) - Elasticsearch: Scale Elasticsearch cluster - Cassandra: Scale Cassandra cluster
Collector Scaling:
Capacity Planning¶
Metrics Collection¶
Estimate: - Metrics per service: 100-1000 - Services: 10-100 - Total metrics: 1,000-100,000 - Sample rate: 15s - Storage: ~1KB per metric per sample
Calculation:
Storage per day = metrics × samples_per_day × size_per_sample
Storage per day = 10,000 × 5,760 × 1KB = ~55GB
Trace Collection¶
Estimate: - Traces per request: 1 - Requests per second: 1000 - Spans per trace: 10 - Storage: ~1KB per span
Calculation:
Spans per day = requests_per_second × spans_per_trace × seconds_per_day
Spans per day = 1,000 × 10 × 86,400 = 864,000,000
Storage per day = 864,000,000 × 1KB = ~824GB
Log Collection¶
Estimate: - Logs per service: 100-1000 per minute - Services: 10-100 - Storage: ~500 bytes per log
Calculation:
Logs per day = services × logs_per_minute × minutes_per_day
Logs per day = 50 × 500 × 1,440 = 36,000,000
Storage per day = 36,000,000 × 500 bytes = ~17GB
Scaling Strategies¶
Proactive Scaling¶
Based on Predictable Patterns: - Time-based scaling (business hours) - Event-based scaling (releases) - Calendar-based scaling (holidays)
Reactive Scaling¶
Based on Metrics: - CPU utilization - Memory usage - Queue depth - Error rates
Predictive Scaling¶
Based on ML Models: - Predict future load - Scale before demand - Optimize resource usage
Cost Optimization¶
Data Volume Reduction¶
Sampling:
Filtering:
Aggregation: - Pre-aggregate metrics - Reduce cardinality - Use histograms
Resource Optimization¶
Right-sizing: - Match resources to workload - Regular reviews - Adjust based on metrics
Auto-scaling: - Scale down during low usage - Scale up during high usage - Use spot instances for non-critical
Monitoring Scaling¶
Key Metrics¶
Collector:
- otelcol_receiver_accepted_spans
- otelcol_exporter_send_failed_spans
- otelcol_processor_batch_batch_send_size
- CPU and memory usage
Backends: - Ingestion rate - Storage usage - Query latency - Error rates
Alerts¶
Scale-up Alerts: - High CPU utilization (>80%) - High memory usage (>80%) - Queue depth increasing - Error rates increasing
Scale-down Alerts: - Low CPU utilization (<30%) - Low memory usage (<30%) - Low ingestion rate - No errors
Further Reading¶
- Observability Production Patterns
- OpenTelemetry Collector
- Observability Stacks: Observability stack comparison and selection