Skip to content

Observability Production Patterns

This document outlines production-ready patterns for deploying and operating observability stacks.

High Availability

Collector Deployment

Pattern: Deploy multiple collector instances behind a load balancer

# Kubernetes example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 3  # Multiple instances
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Benefits:

  • No single point of failure
  • Automatic failover
  • Load distribution

Backend Redundancy

Pattern: Deploy backends in HA configuration

  • Prometheus: Use Prometheus Operator with Thanos for long-term storage
  • Jaeger: Use Elasticsearch or Cassandra backend
  • ELK: Deploy Elasticsearch cluster with multiple nodes

Scaling Strategies

Horizontal Scaling

Collector:

  • Deploy multiple collector instances
  • Use load balancer for distribution
  • Scale based on CPU/memory metrics

Backends:

  • Prometheus: Use federation or Thanos
  • Elasticsearch: Add nodes to cluster
  • Jaeger: Scale based on storage backend

Vertical Scaling

When to Use:

  • Single-instance deployments
  • Resource-constrained environments
  • Cost optimization

Considerations:

  • Monitor resource usage
  • Set appropriate limits
  • Plan for capacity

Resource Allocation

Collector Resources

Development:

resources:
  requests:
    memory: "256Mi"
    cpu: "200m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Production:

resources:
  requests:
    memory: "1Gi"
    cpu: "1000m"
  limits:
    memory: "2Gi"
    cpu: "2000m"

Backend Resources

Prometheus:

  • Memory: 2-4GB (depends on retention)
  • CPU: 1-2 cores
  • Storage: 50-100GB (depends on retention)

Elasticsearch:

  • Memory: 4-8GB per node
  • CPU: 2-4 cores per node
  • Storage: 100GB+ per node

Jaeger:

  • Memory: 512MB-1GB
  • CPU: 500m-1000m
  • Storage: Depends on backend

Security Hardening

Network Security

Network Policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: observability-network-policy
spec:
  podSelector:
    matchLabels:
      component: observability
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: default
      ports:
        - protocol: TCP
          port: 4317

TLS/SSL

Collector Configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/certs/server.crt
          key_file: /etc/certs/server.key

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      cert_file: /etc/certs/client.crt
      key_file: /etc/certs/client.key

Authentication

API Keys:

exporters:
  otlp/seq:
    endpoint: http://seq:5342/ingest/otlp/v1
    headers:
      X-Seq-ApiKey: ${SEQ_API_KEY}

OAuth2:

  • Use service mesh (Istio, Linkerd) for mTLS
  • Implement OAuth2 for API access

Secrets Management

Kubernetes Secrets:

apiVersion: v1
kind: Secret
metadata:
  name: observability-secrets
type: Opaque
stringData:
  seq-api-key: <api-key>
  azure-connection-string: <connection-string>

Azure Key Vault / AWS Secrets Manager:

  • Store sensitive configuration
  • Rotate secrets regularly
  • Audit secret access

Data Retention

Metrics Retention

Prometheus:

# prometheus.yml
global:
  retention: 30d  # Keep 30 days

Considerations:

  • Storage costs
  • Query performance
  • Compliance requirements

Trace Retention

Jaeger:

  • Memory: 1-2 days
  • Elasticsearch: 7-30 days
  • Long-term: Archive to object storage

Log Retention

Elasticsearch:

# Index lifecycle policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0d",
        "actions": {
          "rollover": {
            "max_size": "50GB"
          }
        }
      },
      "delete": {
        "min_age": "30d"
      }
    }
  }
}

Monitoring the Monitor

Collector Health

Health Checks:

livenessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 5
  periodSeconds: 5

Metrics Monitoring:

  • Monitor collector's own metrics
  • Alert on high error rates
  • Track resource usage

Backend Health

Prometheus:

  • Monitor Prometheus targets
  • Alert on scrape failures
  • Track ingestion rate

Elasticsearch:

  • Monitor cluster health
  • Alert on yellow/red status
  • Track index sizes

Jaeger:

  • Monitor storage backend
  • Alert on ingestion failures
  • Track trace volume

Disaster Recovery

Backup Strategies

Prometheus:

  • Backup configuration files
  • Export snapshots
  • Use Thanos for long-term storage

Elasticsearch:

  • Snapshot to object storage
  • Cross-region replication
  • Index lifecycle policies

Jaeger:

  • Backup storage backend
  • Export traces for archival

Recovery Procedures

  1. Collector Failure:
  2. Automatic failover to backup instances
  3. Restore from configuration backup
  4. Verify data flow

  5. Backend Failure:

  6. Failover to backup backend
  7. Restore from backups
  8. Replay data if needed

  9. Data Loss:

  10. Restore from backups
  11. Replay from application logs
  12. Verify data integrity

Performance Optimization

Batch Processing

processors:
  batch:
    timeout: 5s  # Increase for better batching
    send_batch_size: 2048  # Larger batches
    send_batch_max_size: 4096

Sampling

processors:
  probabilistic_sampler:
    sampling_percentage: 10.0  # Sample 10% of traces

Resource Optimization

processors:
  memory_limiter:
    limit_mib: 400
    spike_limit_mib: 100
    check_interval: 5s

Cost Optimization

Data Volume Reduction

  1. Sampling: Reduce trace volume
  2. Filtering: Drop unnecessary data
  3. Aggregation: Pre-aggregate metrics
  4. Retention: Reduce retention periods

Resource Optimization

  1. Right-sizing: Match resources to workload
  2. Auto-scaling: Scale based on demand
  3. Spot Instances: Use for non-critical workloads
  4. Reserved Instances: For predictable workloads

Compliance and Governance

Data Privacy

  1. PII Redaction: Remove sensitive data
  2. Data Masking: Mask sensitive fields
  3. Access Control: Restrict access to sensitive data
  4. Audit Logging: Log all access

Retention Policies

  1. Define Policies: Based on compliance requirements
  2. Automate Deletion: Use lifecycle policies
  3. Archive: Move old data to cold storage
  4. Documentation: Document retention policies

Further Reading