Observability Scaling¶
This document covers strategies for scaling observability infrastructure.
Scaling Dimensions¶
Horizontal Scaling¶
Add more instances to handle increased load.
Benefits:
- Linear scaling
- No downtime
- Better fault tolerance
Challenges:
- State management
- Load distribution
- Coordination
Vertical Scaling¶
Increase resources of existing instances.
Benefits:
- Simple
- No coordination needed
- Lower complexity
Challenges:
- Limited by hardware
- Downtime for scaling
- Cost at scale
Collector Scaling¶
Horizontal Scaling Pattern¶
Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 3 # Scale based on load
strategy:
type: RollingUpdate
Auto-scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: otel-collector-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: otel-collector
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Load Balancing¶
Service Configuration:
apiVersion: v1
kind: Service
metadata:
name: otel-collector
spec:
type: ClusterIP
ports:
- port: 4317
targetPort: 4317
selector:
app: otel-collector
Performance Tuning¶
Batch Processing:
processors:
batch:
timeout: 5s # Increase for better batching
send_batch_size: 2048
send_batch_max_size: 4096
Memory Management:
Backend Scaling¶
Prometheus Scaling¶
Federation Pattern:
# Global Prometheus
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="otel-collector"}'
static_configs:
- targets:
- 'prometheus-region-1:9090'
- 'prometheus-region-2:9090'
Thanos for Long-term Storage:
- Global query view
- Long-term retention
- Deduplication
- Compression
Elasticsearch Scaling¶
Cluster Configuration:
# Elasticsearch cluster with 3 nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
spec:
replicas: 3
serviceName: elasticsearch
template:
spec:
containers:
- name: elasticsearch
env:
- name: discovery.seed_hosts
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
Sharding Strategy:
- Primary shards: 3-5 per index
- Replica shards: 1-2 per primary
- Index per day/week
Jaeger Scaling¶
Storage Backend Scaling:
- Memory: Not scalable (single instance)
- Elasticsearch: Scale Elasticsearch cluster
- Cassandra: Scale Cassandra cluster
Collector Scaling:
Capacity Planning¶
Metrics Collection¶
Estimate:
- Metrics per service: 100-1000
- Services: 10-100
- Total metrics: 1,000-100,000
- Sample rate: 15s
- Storage: ~1KB per metric per sample
Calculation:
Storage per day = metrics × samples_per_day × size_per_sample
Storage per day = 10,000 × 5,760 × 1KB = ~55GB
Trace Collection¶
Estimate:
- Traces per request: 1
- Requests per second: 1000
- Spans per trace: 10
- Storage: ~1KB per span
Calculation:
Spans per day = requests_per_second × spans_per_trace × seconds_per_day
Spans per day = 1,000 × 10 × 86,400 = 864,000,000
Storage per day = 864,000,000 × 1KB = ~824GB
Log Collection¶
Estimate:
- Logs per service: 100-1000 per minute
- Services: 10-100
- Storage: ~500 bytes per log
Calculation:
Logs per day = services × logs_per_minute × minutes_per_day
Logs per day = 50 × 500 × 1,440 = 36,000,000
Storage per day = 36,000,000 × 500 bytes = ~17GB
Scaling Strategies¶
Proactive Scaling¶
Based on Predictable Patterns:
- Time-based scaling (business hours)
- Event-based scaling (releases)
- Calendar-based scaling (holidays)
Reactive Scaling¶
Based on Metrics:
- CPU utilization
- Memory usage
- Queue depth
- Error rates
Predictive Scaling¶
Based on ML Models:
- Predict future load
- Scale before demand
- Optimize resource usage
Cost Optimization¶
Data Volume Reduction¶
Sampling:
Filtering:
Aggregation:
- Pre-aggregate metrics
- Reduce cardinality
- Use histograms
Resource Optimization¶
Right-sizing:
- Match resources to workload
- Regular reviews
- Adjust based on metrics
Auto-scaling:
- Scale down during low usage
- Scale up during high usage
- Use spot instances for non-critical
Monitoring Scaling¶
Key Metrics¶
Collector:
otelcol_receiver_accepted_spansotelcol_exporter_send_failed_spansotelcol_processor_batch_batch_send_size- CPU and memory usage
Backends:
- Ingestion rate
- Storage usage
- Query latency
- Error rates
Alerts¶
Scale-up Alerts:
- High CPU utilization (>80%)
- High memory usage (>80%)
- Queue depth increasing
- Error rates increasing
Scale-down Alerts:
- Low CPU utilization (<30%)
- Low memory usage (<30%)
- Low ingestion rate
- No errors
Further Reading¶
- Observability Production Patterns
- OpenTelemetry Collector
- Observability Stacks: Observability stack comparison and selection