Skip to content

Observability Production Patterns

This document outlines production-ready patterns for deploying and operating observability stacks.

High Availability

Collector Deployment

Pattern: Deploy multiple collector instances behind a load balancer

# Kubernetes example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 3  # Multiple instances
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Benefits: - No single point of failure - Automatic failover - Load distribution

Backend Redundancy

Pattern: Deploy backends in HA configuration

  • Prometheus: Use Prometheus Operator with Thanos for long-term storage
  • Jaeger: Use Elasticsearch or Cassandra backend
  • ELK: Deploy Elasticsearch cluster with multiple nodes

Scaling Strategies

Horizontal Scaling

Collector: - Deploy multiple collector instances - Use load balancer for distribution - Scale based on CPU/memory metrics

Backends: - Prometheus: Use federation or Thanos - Elasticsearch: Add nodes to cluster - Jaeger: Scale based on storage backend

Vertical Scaling

When to Use: - Single-instance deployments - Resource-constrained environments - Cost optimization

Considerations: - Monitor resource usage - Set appropriate limits - Plan for capacity

Resource Allocation

Collector Resources

Development:

resources:
  requests:
    memory: "256Mi"
    cpu: "200m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Production:

resources:
  requests:
    memory: "1Gi"
    cpu: "1000m"
  limits:
    memory: "2Gi"
    cpu: "2000m"

Backend Resources

Prometheus: - Memory: 2-4GB (depends on retention) - CPU: 1-2 cores - Storage: 50-100GB (depends on retention)

Elasticsearch: - Memory: 4-8GB per node - CPU: 2-4 cores per node - Storage: 100GB+ per node

Jaeger: - Memory: 512MB-1GB - CPU: 500m-1000m - Storage: Depends on backend

Security Hardening

Network Security

Network Policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: observability-network-policy
spec:
  podSelector:
    matchLabels:
      component: observability
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: default
      ports:
        - protocol: TCP
          port: 4317

TLS/SSL

Collector Configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/certs/server.crt
          key_file: /etc/certs/server.key

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      cert_file: /etc/certs/client.crt
      key_file: /etc/certs/client.key

Authentication

API Keys:

exporters:
  otlp/seq:
    endpoint: http://seq:5342/ingest/otlp/v1
    headers:
      X-Seq-ApiKey: ${SEQ_API_KEY}

OAuth2: - Use service mesh (Istio, Linkerd) for mTLS - Implement OAuth2 for API access

Secrets Management

Kubernetes Secrets:

apiVersion: v1
kind: Secret
metadata:
  name: observability-secrets
type: Opaque
stringData:
  seq-api-key: <api-key>
  azure-connection-string: <connection-string>

Azure Key Vault / AWS Secrets Manager: - Store sensitive configuration - Rotate secrets regularly - Audit secret access

Data Retention

Metrics Retention

Prometheus:

# prometheus.yml
global:
  retention: 30d  # Keep 30 days

Considerations: - Storage costs - Query performance - Compliance requirements

Trace Retention

Jaeger: - Memory: 1-2 days - Elasticsearch: 7-30 days - Long-term: Archive to object storage

Log Retention

Elasticsearch:

# Index lifecycle policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0d",
        "actions": {
          "rollover": {
            "max_size": "50GB"
          }
        }
      },
      "delete": {
        "min_age": "30d"
      }
    }
  }
}

Monitoring the Monitor

Collector Health

Health Checks:

livenessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 5
  periodSeconds: 5

Metrics Monitoring: - Monitor collector's own metrics - Alert on high error rates - Track resource usage

Backend Health

Prometheus: - Monitor Prometheus targets - Alert on scrape failures - Track ingestion rate

Elasticsearch: - Monitor cluster health - Alert on yellow/red status - Track index sizes

Jaeger: - Monitor storage backend - Alert on ingestion failures - Track trace volume

Disaster Recovery

Backup Strategies

Prometheus: - Backup configuration files - Export snapshots - Use Thanos for long-term storage

Elasticsearch: - Snapshot to object storage - Cross-region replication - Index lifecycle policies

Jaeger: - Backup storage backend - Export traces for archival

Recovery Procedures

  1. Collector Failure:
  2. Automatic failover to backup instances
  3. Restore from configuration backup
  4. Verify data flow

  5. Backend Failure:

  6. Failover to backup backend
  7. Restore from backups
  8. Replay data if needed

  9. Data Loss:

  10. Restore from backups
  11. Replay from application logs
  12. Verify data integrity

Performance Optimization

Batch Processing

processors:
  batch:
    timeout: 5s  # Increase for better batching
    send_batch_size: 2048  # Larger batches
    send_batch_max_size: 4096

Sampling

processors:
  probabilistic_sampler:
    sampling_percentage: 10.0  # Sample 10% of traces

Resource Optimization

processors:
  memory_limiter:
    limit_mib: 400
    spike_limit_mib: 100
    check_interval: 5s

Cost Optimization

Data Volume Reduction

  1. Sampling: Reduce trace volume
  2. Filtering: Drop unnecessary data
  3. Aggregation: Pre-aggregate metrics
  4. Retention: Reduce retention periods

Resource Optimization

  1. Right-sizing: Match resources to workload
  2. Auto-scaling: Scale based on demand
  3. Spot Instances: Use for non-critical workloads
  4. Reserved Instances: For predictable workloads

Compliance and Governance

Data Privacy

  1. PII Redaction: Remove sensitive data
  2. Data Masking: Mask sensitive fields
  3. Access Control: Restrict access to sensitive data
  4. Audit Logging: Log all access

Retention Policies

  1. Define Policies: Based on compliance requirements
  2. Automate Deletion: Use lifecycle policies
  3. Archive: Move old data to cold storage
  4. Documentation: Document retention policies

Further Reading