Observability Production Patterns¶
This document outlines production-ready patterns for deploying and operating observability stacks.
High Availability¶
Collector Deployment¶
Pattern: Deploy multiple collector instances behind a load balancer
# Kubernetes example
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 3 # Multiple instances
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Benefits:
- No single point of failure
- Automatic failover
- Load distribution
Backend Redundancy¶
Pattern: Deploy backends in HA configuration
- Prometheus: Use Prometheus Operator with Thanos for long-term storage
- Jaeger: Use Elasticsearch or Cassandra backend
- ELK: Deploy Elasticsearch cluster with multiple nodes
Scaling Strategies¶
Horizontal Scaling¶
Collector:
- Deploy multiple collector instances
- Use load balancer for distribution
- Scale based on CPU/memory metrics
Backends:
- Prometheus: Use federation or Thanos
- Elasticsearch: Add nodes to cluster
- Jaeger: Scale based on storage backend
Vertical Scaling¶
When to Use:
- Single-instance deployments
- Resource-constrained environments
- Cost optimization
Considerations:
- Monitor resource usage
- Set appropriate limits
- Plan for capacity
Resource Allocation¶
Collector Resources¶
Development:
Production:
Backend Resources¶
Prometheus:
- Memory: 2-4GB (depends on retention)
- CPU: 1-2 cores
- Storage: 50-100GB (depends on retention)
Elasticsearch:
- Memory: 4-8GB per node
- CPU: 2-4 cores per node
- Storage: 100GB+ per node
Jaeger:
- Memory: 512MB-1GB
- CPU: 500m-1000m
- Storage: Depends on backend
Security Hardening¶
Network Security¶
Network Policies:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: observability-network-policy
spec:
podSelector:
matchLabels:
component: observability
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: default
ports:
- protocol: TCP
port: 4317
TLS/SSL¶
Collector Configuration:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: /etc/certs/server.crt
key_file: /etc/certs/server.key
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
cert_file: /etc/certs/client.crt
key_file: /etc/certs/client.key
Authentication¶
API Keys:
OAuth2:
- Use service mesh (Istio, Linkerd) for mTLS
- Implement OAuth2 for API access
Secrets Management¶
Kubernetes Secrets:
apiVersion: v1
kind: Secret
metadata:
name: observability-secrets
type: Opaque
stringData:
seq-api-key: <api-key>
azure-connection-string: <connection-string>
Azure Key Vault / AWS Secrets Manager:
- Store sensitive configuration
- Rotate secrets regularly
- Audit secret access
Data Retention¶
Metrics Retention¶
Prometheus:
Considerations:
- Storage costs
- Query performance
- Compliance requirements
Trace Retention¶
Jaeger:
- Memory: 1-2 days
- Elasticsearch: 7-30 days
- Long-term: Archive to object storage
Log Retention¶
Elasticsearch:
# Index lifecycle policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0d",
"actions": {
"rollover": {
"max_size": "50GB"
}
}
},
"delete": {
"min_age": "30d"
}
}
}
}
Monitoring the Monitor¶
Collector Health¶
Health Checks:
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 5
periodSeconds: 5
Metrics Monitoring:
- Monitor collector's own metrics
- Alert on high error rates
- Track resource usage
Backend Health¶
Prometheus:
- Monitor Prometheus targets
- Alert on scrape failures
- Track ingestion rate
Elasticsearch:
- Monitor cluster health
- Alert on yellow/red status
- Track index sizes
Jaeger:
- Monitor storage backend
- Alert on ingestion failures
- Track trace volume
Disaster Recovery¶
Backup Strategies¶
Prometheus:
- Backup configuration files
- Export snapshots
- Use Thanos for long-term storage
Elasticsearch:
- Snapshot to object storage
- Cross-region replication
- Index lifecycle policies
Jaeger:
- Backup storage backend
- Export traces for archival
Recovery Procedures¶
- Collector Failure:
- Automatic failover to backup instances
- Restore from configuration backup
-
Verify data flow
-
Backend Failure:
- Failover to backup backend
- Restore from backups
-
Replay data if needed
-
Data Loss:
- Restore from backups
- Replay from application logs
- Verify data integrity
Performance Optimization¶
Batch Processing¶
processors:
batch:
timeout: 5s # Increase for better batching
send_batch_size: 2048 # Larger batches
send_batch_max_size: 4096
Sampling¶
Resource Optimization¶
Cost Optimization¶
Data Volume Reduction¶
- Sampling: Reduce trace volume
- Filtering: Drop unnecessary data
- Aggregation: Pre-aggregate metrics
- Retention: Reduce retention periods
Resource Optimization¶
- Right-sizing: Match resources to workload
- Auto-scaling: Scale based on demand
- Spot Instances: Use for non-critical workloads
- Reserved Instances: For predictable workloads
Compliance and Governance¶
Data Privacy¶
- PII Redaction: Remove sensitive data
- Data Masking: Mask sensitive fields
- Access Control: Restrict access to sensitive data
- Audit Logging: Log all access
Retention Policies¶
- Define Policies: Based on compliance requirements
- Automate Deletion: Use lifecycle policies
- Archive: Move old data to cold storage
- Documentation: Document retention policies