Observability Production Patterns¶
This document outlines production-ready patterns for deploying and operating observability stacks.
High Availability¶
Collector Deployment¶
Pattern: Deploy multiple collector instances behind a load balancer
# Kubernetes example
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 3 # Multiple instances
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Benefits: - No single point of failure - Automatic failover - Load distribution
Backend Redundancy¶
Pattern: Deploy backends in HA configuration
- Prometheus: Use Prometheus Operator with Thanos for long-term storage
- Jaeger: Use Elasticsearch or Cassandra backend
- ELK: Deploy Elasticsearch cluster with multiple nodes
Scaling Strategies¶
Horizontal Scaling¶
Collector: - Deploy multiple collector instances - Use load balancer for distribution - Scale based on CPU/memory metrics
Backends: - Prometheus: Use federation or Thanos - Elasticsearch: Add nodes to cluster - Jaeger: Scale based on storage backend
Vertical Scaling¶
When to Use: - Single-instance deployments - Resource-constrained environments - Cost optimization
Considerations: - Monitor resource usage - Set appropriate limits - Plan for capacity
Resource Allocation¶
Collector Resources¶
Development:
Production:
Backend Resources¶
Prometheus: - Memory: 2-4GB (depends on retention) - CPU: 1-2 cores - Storage: 50-100GB (depends on retention)
Elasticsearch: - Memory: 4-8GB per node - CPU: 2-4 cores per node - Storage: 100GB+ per node
Jaeger: - Memory: 512MB-1GB - CPU: 500m-1000m - Storage: Depends on backend
Security Hardening¶
Network Security¶
Network Policies:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: observability-network-policy
spec:
podSelector:
matchLabels:
component: observability
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: default
ports:
- protocol: TCP
port: 4317
TLS/SSL¶
Collector Configuration:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: /etc/certs/server.crt
key_file: /etc/certs/server.key
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
cert_file: /etc/certs/client.crt
key_file: /etc/certs/client.key
Authentication¶
API Keys:
OAuth2: - Use service mesh (Istio, Linkerd) for mTLS - Implement OAuth2 for API access
Secrets Management¶
Kubernetes Secrets:
apiVersion: v1
kind: Secret
metadata:
name: observability-secrets
type: Opaque
stringData:
seq-api-key: <api-key>
azure-connection-string: <connection-string>
Azure Key Vault / AWS Secrets Manager: - Store sensitive configuration - Rotate secrets regularly - Audit secret access
Data Retention¶
Metrics Retention¶
Prometheus:
Considerations: - Storage costs - Query performance - Compliance requirements
Trace Retention¶
Jaeger: - Memory: 1-2 days - Elasticsearch: 7-30 days - Long-term: Archive to object storage
Log Retention¶
Elasticsearch:
# Index lifecycle policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0d",
"actions": {
"rollover": {
"max_size": "50GB"
}
}
},
"delete": {
"min_age": "30d"
}
}
}
}
Monitoring the Monitor¶
Collector Health¶
Health Checks:
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 5
periodSeconds: 5
Metrics Monitoring: - Monitor collector's own metrics - Alert on high error rates - Track resource usage
Backend Health¶
Prometheus: - Monitor Prometheus targets - Alert on scrape failures - Track ingestion rate
Elasticsearch: - Monitor cluster health - Alert on yellow/red status - Track index sizes
Jaeger: - Monitor storage backend - Alert on ingestion failures - Track trace volume
Disaster Recovery¶
Backup Strategies¶
Prometheus: - Backup configuration files - Export snapshots - Use Thanos for long-term storage
Elasticsearch: - Snapshot to object storage - Cross-region replication - Index lifecycle policies
Jaeger: - Backup storage backend - Export traces for archival
Recovery Procedures¶
- Collector Failure:
- Automatic failover to backup instances
- Restore from configuration backup
-
Verify data flow
-
Backend Failure:
- Failover to backup backend
- Restore from backups
-
Replay data if needed
-
Data Loss:
- Restore from backups
- Replay from application logs
- Verify data integrity
Performance Optimization¶
Batch Processing¶
processors:
batch:
timeout: 5s # Increase for better batching
send_batch_size: 2048 # Larger batches
send_batch_max_size: 4096
Sampling¶
Resource Optimization¶
Cost Optimization¶
Data Volume Reduction¶
- Sampling: Reduce trace volume
- Filtering: Drop unnecessary data
- Aggregation: Pre-aggregate metrics
- Retention: Reduce retention periods
Resource Optimization¶
- Right-sizing: Match resources to workload
- Auto-scaling: Scale based on demand
- Spot Instances: Use for non-critical workloads
- Reserved Instances: For predictable workloads
Compliance and Governance¶
Data Privacy¶
- PII Redaction: Remove sensitive data
- Data Masking: Mask sensitive fields
- Access Control: Restrict access to sensitive data
- Audit Logging: Log all access
Retention Policies¶
- Define Policies: Based on compliance requirements
- Automate Deletion: Use lifecycle policies
- Archive: Move old data to cold storage
- Documentation: Document retention policies