OpenTelemetry Collector¶
Overview¶
The OpenTelemetry Collector is a vendor-agnostic service that receives, processes, and exports telemetry data. It acts as a central hub for observability, decoupling applications from specific observability backends.
Architecture¶
Components¶
-
Receivers: Receive telemetry data from various sources
- OTLP (gRPC/HTTP)
- Prometheus
- Jaeger
- Zipkin
-
Processors: Process and transform telemetry data
- Batch: Batch data for efficiency
- Memory Limiter: Prevent memory exhaustion
- Resource: Add/modify resource attributes
- Attributes: Modify span/metric/log attributes
- Sampling: Reduce data volume
-
Exporters: Export data to backends
- Grafana LGTM Stack (Loki, Tempo, Mimir)
- Prometheus
- Jaeger
- Elasticsearch
- Azure Monitor
- And many more...
-
Extensions: Provide additional functionality
- Health Check: Health monitoring endpoint
- zPages: Debugging interface
- pprof: Performance profiling
Data Flow¶
Benefits¶
- Decoupling: Applications don't need to know about specific backends
- Centralized Processing: Process data once, export to multiple backends
- Flexibility: Easy to add/remove backends without code changes
- Consistency: Standardized telemetry format across services
- Performance: Batch processing and sampling reduce overhead
Configuration¶
Basic Configuration¶
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
debug:
verbosity: detailed
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug, prometheus]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]
Deployment Patterns¶
Sidecar Pattern¶
Deploy collector as a sidecar container alongside each application instance.
Pros:
- Isolation
- Per-service configuration
- No network latency
Cons:
- Resource overhead
- More complex deployment
Gateway Pattern¶
Deploy collector as a centralized gateway service.
Pros:
- Resource efficient
- Centralized configuration
- Easier management
Cons:
- Single point of failure (mitigate with HA)
- Network latency
Agent + Gateway Pattern¶
Deploy lightweight agents on each host and a gateway in the cluster.
Pros:
- Best of both worlds
- Scalable
- Flexible
Cons:
- More complex architecture
Best Practices¶
Performance¶
- Batch Processing: Always use batch processor
- Memory Limits: Configure memory limiter
- Sampling: Use sampling for high-volume traces
- Resource Allocation: Allocate sufficient CPU/memory
Reliability¶
- High Availability: Deploy multiple collector instances
- Queuing: Enable exporter queues for resilience
- Retries: Configure retry policies
- Health Checks: Monitor collector health
Security¶
- TLS: Use TLS for all connections
- Authentication: Implement authentication for backends
- Network Policies: Restrict network access
- Secrets Management: Use secrets for sensitive data
Monitoring¶
Health Endpoint¶
Metrics Endpoint¶
zPages¶
Access debugging interface at: http://localhost:55679
Troubleshooting¶
Common Issues¶
- High Memory Usage: Reduce batch sizes, enable sampling
- Data Loss: Enable exporter queues, check backend connectivity
- Slow Performance: Optimize processors, increase resources
- Configuration Errors: Validate config with
--dry-run
Debugging¶
- Enable debug logging:
LOG_LEVEL=debug - Use debug exporter for local development
- Check zPages for internal state
- Monitor metrics endpoint
Examples¶
- Grafana LGTM Stack Example: Complete setup with Loki, Tempo, and Mimir
- Custom Processors Example: Custom processor configuration
- Multiple Backends Example: Exporting to multiple backends