AI Observability in ConnectSoft Microservice Template¶
Purpose & Overview¶
AI Observability in the ConnectSoft Microservice Template provides comprehensive monitoring, logging, and tracing capabilities for AI services using Microsoft.Extensions.AI. This enables teams to gain deep insights into AI operations, including chat completions, embedding generation, tool invocations, and vector store operations, while maintaining full visibility into performance, costs, and errors.
AI Observability integrates seamlessly with the template's observability stack, leveraging OpenTelemetry, structured logging, and distributed tracing to provide end-to-end visibility into AI-powered operations.
Why AI Observability Matters¶
AI observability provides critical capabilities:
- Performance Monitoring: Track latency, throughput, and response times for AI operations
- Cost Tracking: Monitor token usage and API costs across all AI providers
- Error Detection: Identify and diagnose AI service failures and errors
- Usage Analytics: Understand AI service utilization patterns and trends
- Quality Assessment: Evaluate AI response quality and consistency
- Debugging: Trace AI operations through distributed systems
- Optimization: Identify opportunities for caching, batching, and cost reduction
AI Observability Philosophy
AI observability treats AI operations as first-class citizens in the observability stack. By instrumenting AI services with OpenTelemetry, structured logging, and distributed tracing, teams can understand, debug, and optimize AI-powered applications with the same rigor applied to traditional application components.
Architecture Overview¶
AI Observability Stack¶
AI Operations (Chat, Embeddings, Tools)
↓
Microsoft.Extensions.AI Middleware
├── OpenTelemetry Instrumentation
│ ├── Traces (Distributed Tracing)
│ ├── Metrics (Performance & Usage)
│ └── Logs (Structured Events)
├── Structured Logging
│ ├── Request/Response Logging
│ ├── Token Usage Tracking
│ └── Error Logging
└── Distributed Caching
├── Cache Hit/Miss Metrics
└── Cost Reduction Tracking
↓
OpenTelemetry SDK
├── ActivitySource: "Experimental.Microsoft.Extensions.AI*"
├── Meter: "Experimental.Microsoft.Extensions.AI*"
└── Log Categories: "Microsoft.Extensions.AI.*"
↓
Observability Backends
├── Jaeger, Zipkin (traces)
├── Prometheus, Grafana (metrics)
├── Seq, Application Insights (unified)
└── Other OTLP-compatible systems
OpenTelemetry Integration¶
Automatic Instrumentation¶
Microsoft.Extensions.AI automatically instruments all AI operations when OpenTelemetry middleware is enabled:
// MicrosoftExtensionsAIExtensions.cs
#if OpenTelemetry
chatClientBuilder.UseOpenTelemetry();
embeddingGeneratorClientBuilder.UseOpenTelemetry();
#endif
ActivitySource Registration¶
AI operations are traced using the Experimental.Microsoft.Extensions.AI* ActivitySource:
// OpenTelemetryExtensions.cs
#if UseMicrosoftExtensionsAI
tracingBuilder
.AddSource("Experimental.Microsoft.Extensions.AI*");
#endif
What's Traced: - Chat completion requests and responses - Embedding generation requests - Tool invocations (function calling) - Vector store operations - Token usage and costs - Latency and errors - Provider-specific operations (OpenAI, Azure OpenAI, Ollama, Azure AI Inference)
Meter Registration¶
AI metrics are collected using the Experimental.Microsoft.Extensions.AI* Meter:
// OpenTelemetryExtensions.cs
#if UseMicrosoftExtensionsAI
metricsBuilder
.AddMeter("Experimental.Microsoft.Extensions.AI*");
#endif
Metrics Collected: - Request counts (by provider, model, operation type) - Token usage (input tokens, output tokens, total tokens) - Latency (request duration, p50, p95, p99) - Error rates (by provider, error type) - Cache hit/miss rates - Cost estimates (when available)
Trace Attributes¶
AI traces include rich contextual information:
Common Attributes:
- ai.provider: Provider name (e.g., "openAI", "azureOpenAI", "ollama")
- ai.model: Model identifier (e.g., "gpt-4o", "text-embedding-3-small")
- ai.operation.type: Operation type (e.g., "chat.completion", "embedding.generation")
- ai.request.id: Unique request identifier
- ai.response.id: Response identifier (when available)
Chat Completion Attributes:
- ai.chat.message.count: Number of messages in the conversation
- ai.chat.tokens.input: Input token count
- ai.chat.tokens.output: Output token count
- ai.chat.tokens.total: Total token count
- ai.chat.finish_reason: Completion finish reason
- ai.chat.function_calls: Number of function calls (if any)
Embedding Attributes:
- ai.embedding.input.length: Input text length
- ai.embedding.dimensions: Embedding vector dimensions
- ai.embedding.tokens: Token count for embedding generation
Tool Invocation Attributes:
- ai.tool.name: Tool/function name
- ai.tool.arguments: Tool arguments (sanitized)
- ai.tool.result: Tool execution result (sanitized)
Error Attributes:
- error.type: Error type (e.g., "RateLimitError", "AuthenticationError")
- error.message: Error message (sanitized)
- error.stack_trace: Stack trace (in development only)
Example Trace¶
{
"traceId": "00-ab12cd34ef567890abcdef1234567890",
"spanId": "1a2b3c4d5e6f7890",
"name": "ai.chat.completion",
"kind": "CLIENT",
"startTime": "2025-01-15T17:22:58.123Z",
"endTime": "2025-01-15T17:22:59.456Z",
"duration": "1333ms",
"status": "OK",
"attributes": {
"ai.provider": "openAI",
"ai.model": "gpt-4o",
"ai.operation.type": "chat.completion",
"ai.request.id": "req-abc123",
"ai.chat.message.count": 3,
"ai.chat.tokens.input": 150,
"ai.chat.tokens.output": 75,
"ai.chat.tokens.total": 225,
"ai.chat.finish_reason": "stop",
"http.method": "POST",
"http.url": "https://api.openai.com/v1/chat/completions",
"http.status_code": 200
}
}
Structured Logging¶
Log Categories¶
AI operations are logged using structured logging with specific log categories:
Microsoft.Extensions.AI.Chat: Chat completion operationsMicrosoft.Extensions.AI.Embeddings: Embedding generation operationsMicrosoft.Extensions.AI.Tools: Tool invocation operationsMicrosoft.Extensions.AI.VectorData: Vector store operations
Logging Middleware¶
Logging is automatically enabled for all AI operations:
// MicrosoftExtensionsAIExtensions.cs
chatClientBuilder.UseLogging();
embeddingGeneratorClientBuilder.UseLogging();
Logged Information¶
Request Logging: - Provider and model information - Request parameters (messages, options, etc.) - Request timestamp - Correlation IDs (trace ID, span ID)
Response Logging: - Response content (truncated in production) - Token usage (input, output, total) - Response latency - Finish reason - Function calls (if any)
Error Logging: - Error type and message - Stack traces (in development) - Retry attempts (if applicable) - Provider-specific error codes
Example Log Entry¶
{
"timestamp": "2025-01-15T17:22:58.123Z",
"level": "Information",
"category": "Microsoft.Extensions.AI.Chat",
"message": "Chat completion request completed",
"ai.provider": "openAI",
"ai.model": "gpt-4o",
"ai.request.id": "req-abc123",
"ai.chat.tokens.input": 150,
"ai.chat.tokens.output": 75,
"ai.chat.tokens.total": 225,
"duration_ms": 1333,
"traceId": "00-ab12cd34ef567890abcdef1234567890",
"spanId": "1a2b3c4d5e6f7890",
"service": "OrderService",
"environment": "Production"
}
Log Levels¶
- Trace: Detailed diagnostic information (request/response payloads)
- Debug: Diagnostic information for debugging (token usage, timing)
- Information: General information about AI operations (request completion)
- Warning: Warning conditions (rate limits, retries)
- Error: Error conditions (API failures, authentication errors)
- Critical: Critical failures (service unavailability)
Metrics¶
Available Metrics¶
Microsoft.Extensions.AI exposes the following metrics:
Request Metrics:
- ai.requests.total: Total number of AI requests (counter)
- ai.requests.duration: Request duration histogram (histogram)
- ai.requests.errors: Number of failed requests (counter)
Token Metrics:
- ai.tokens.input: Input token count (counter)
- ai.tokens.output: Output token count (counter)
- ai.tokens.total: Total token count (counter)
Cache Metrics:
- ai.cache.hits: Number of cache hits (counter)
- ai.cache.misses: Number of cache misses (counter)
- ai.cache.hit_rate: Cache hit rate (gauge)
Provider-Specific Metrics:
- Metrics tagged with ai.provider (openAI, azureOpenAI, ollama, azureAIInference)
- Metrics tagged with ai.model (model identifier)
- Metrics tagged with ai.operation.type (chat.completion, embedding.generation)
Metric Labels¶
All metrics include labels for filtering and aggregation:
ai.provider: AI provider nameai.model: Model identifierai.operation.type: Operation typeservice.name: Service nameservice.version: Service versiondeployment.environment: Environment name
Example Metrics Query (Prometheus)¶
# Total AI requests per provider
sum(rate(ai_requests_total[5m])) by (ai_provider)
# Average request duration by model
avg(ai_requests_duration_seconds) by (ai_model)
# Token usage rate
sum(rate(ai_tokens_total[5m])) by (ai_provider)
# Cache hit rate
sum(rate(ai_cache_hits_total[5m])) / sum(rate(ai_cache_hits_total[5m]) + rate(ai_cache_misses_total[5m]))
# Error rate by provider
sum(rate(ai_requests_errors_total[5m])) by (ai_provider) / sum(rate(ai_requests_total[5m])) by (ai_provider)
Distributed Caching¶
Cache Integration¶
Distributed caching is automatically enabled when configured:
// MicrosoftExtensionsAIExtensions.cs
#if (DistributedCacheInMemory || DistributedCacheRedis)
chatClientBuilder.UseDistributedCache();
embeddingGeneratorClientBuilder.UseDistributedCache();
#endif
Cache Observability¶
Caching operations are instrumented with observability:
Cache Metrics: - Cache hit/miss rates - Cache operation latency - Cache size and eviction rates
Cache Logging: - Cache hits (debug level) - Cache misses (debug level) - Cache errors (warning level)
Cache Traces: - Cache lookup operations - Cache write operations - Cache eviction events
Cache Keys¶
Cache keys are generated based on: - Chat Completions: Messages content, model, temperature, and other options - Embeddings: Input text, model, and options
Cache keys are hashed to ensure consistent and efficient caching.
Provider-Specific Observability¶
OpenAI¶
Traces: - OpenAI API calls are traced with full request/response details - Token usage is automatically captured - Rate limit information is included in traces
Metrics: - Request counts and latency - Token usage (input, output, total) - Error rates (including rate limit errors)
Logs: - API request/response logging - Token usage logging - Rate limit warnings
Azure OpenAI¶
Traces: - Azure OpenAI API calls are traced with deployment information - Token usage and cost information - Azure-specific metadata (deployment name, endpoint)
Metrics: - Request counts by deployment - Token usage and costs - Error rates
Logs: - Deployment-specific logging - Azure authentication logging - Cost tracking logs
Ollama¶
Traces: - Local Ollama API calls are traced - Model information and response times - Local execution metrics
Metrics: - Request counts and latency - Model-specific metrics - Local execution performance
Logs: - Local API request/response logging - Model loading and execution logs
Azure AI Inference¶
Traces: - Azure AI Inference API calls are traced - Model catalog information - Inference-specific metadata
Metrics: - Request counts by model - Token usage and costs - Error rates
Logs: - Model catalog logging - Inference request/response logging - Cost tracking logs
Vector Store Observability¶
Vector Store Operations¶
Vector store operations are instrumented when UseVectorStore is enabled:
Traced Operations: - Vector upsert operations - Vector search operations - Collection management operations
Metrics: - Vector operation counts - Search latency - Vector store size
Logs: - Vector operation logging - Search query logging - Collection management logging
Embedding Generator Integration¶
Vector store operations automatically include embedding generator observability:
- Embedding generation is traced as part of vector operations
- Token usage for embeddings is tracked
- Embedding generation latency is measured
Best Practices¶
Do's¶
-
Enable OpenTelemetry for AI Services
-
Use Structured Logging
- Logs automatically include trace context
- Use log levels appropriately
-
Include relevant context in log messages
-
Monitor Token Usage
- Track token usage to control costs
- Set up alerts for unexpected token usage
-
Optimize prompts to reduce token consumption
-
Enable Caching
- Use distributed caching to reduce costs
- Monitor cache hit rates
-
Optimize cache keys for better hit rates
-
Set Up Alerts
- Alert on high error rates
- Alert on rate limit errors
-
Alert on unexpected token usage
-
Correlate AI Operations with Business Logic
- Include business context in traces
- Link AI operations to user actions
- Track AI operations in business workflows
Don'ts¶
-
Don't Log Sensitive Data
-
Don't Over-Instrument
- Avoid creating too many spans for simple operations
- Use appropriate log levels
-
Don't log every token usage in production
-
Don't Ignore Errors
- Always log AI errors
- Include error context in traces
-
Set up error alerts
-
Don't Skip Cost Tracking
- Monitor token usage
- Track costs per provider
- Set up cost alerts
Grafana Dashboards¶
AI Observability Dashboard¶
Create Grafana dashboards to visualize AI operations:
Key Panels: - Request rate by provider - Average latency by model - Token usage over time - Error rate by provider - Cache hit rate - Cost estimates (when available)
Example Dashboard Queries:
# Request rate by provider
sum(rate(ai_requests_total[5m])) by (ai_provider)
# Average latency by model
avg(ai_requests_duration_seconds) by (ai_model)
# Token usage rate
sum(rate(ai_tokens_total[5m])) by (ai_provider)
# Error rate
sum(rate(ai_requests_errors_total[5m])) by (ai_provider) / sum(rate(ai_requests_total[5m])) by (ai_provider)
Alerting¶
Recommended Alerts¶
- High Error Rate
- Condition: Error rate > 5% for 5 minutes
- Severity: Warning
-
Action: Investigate provider issues
-
Rate Limit Errors
- Condition: Rate limit errors > 0
- Severity: Warning
-
Action: Review rate limits and implement backoff
-
High Token Usage
- Condition: Token usage > threshold
- Severity: Info
-
Action: Review prompt optimization
-
Slow Response Times
- Condition: P95 latency > 5 seconds
- Severity: Warning
-
Action: Investigate performance issues
-
Cache Miss Rate
- Condition: Cache hit rate < 50%
- Severity: Info
- Action: Review cache configuration
Troubleshooting¶
Issue: No AI Telemetry Appearing¶
Symptoms: AI traces, metrics, or logs not appearing in observability backend.
Solutions:
1. Verify OpenTelemetry Configuration: Ensure UseOpenTelemetry() is called on AI client builders
2. Check ActivitySource Registration: Verify Experimental.Microsoft.Extensions.AI* is registered
3. Check Meter Registration: Verify Experimental.Microsoft.Extensions.AI* meter is registered
4. Review Log Categories: Ensure log categories are not filtered
5. Check Exporter Configuration: Verify telemetry is being exported correctly
Issue: Missing Token Usage Information¶
Symptoms: Token usage not appearing in traces or metrics.
Solutions: 1. Verify Provider Support: Some providers may not expose token usage 2. Check Response Parsing: Ensure responses are being parsed correctly 3. Review Logging Configuration: Token usage may be logged but not traced
Issue: High Observability Overhead¶
Symptoms: Performance degradation with AI observability enabled.
Solutions: 1. Enable Sampling: Use trace sampling to reduce overhead 2. Review Log Levels: Use appropriate log levels in production 3. Optimize Exports: Use efficient exporters (gRPC vs HTTP) 4. Batch Exports: Use batching when available
Related Documentation¶
- AI Extensions: Comprehensive guide to Microsoft.Extensions.AI integration
- OpenTelemetry: Detailed OpenTelemetry configuration and usage
- Logging: Structured logging with trace correlation
- Metrics: Metrics collection and instrumentation
- Distributed Tracing: Distributed tracing patterns
Summary¶
AI Observability in the ConnectSoft Microservice Template provides:
- ✅ OpenTelemetry Integration: Automatic instrumentation for all AI operations
- ✅ Structured Logging: Comprehensive logging with trace correlation
- ✅ Metrics Collection: Performance, usage, and cost metrics
- ✅ Distributed Tracing: End-to-end visibility into AI operations
- ✅ Provider Support: Observability for all AI providers (OpenAI, Azure OpenAI, Ollama, Azure AI Inference)
- ✅ Cache Observability: Monitoring of distributed caching operations
- ✅ Vector Store Observability: Instrumentation for vector store operations
- ✅ Cost Tracking: Token usage and cost monitoring
- ✅ Error Detection: Comprehensive error logging and tracing
- ✅ Performance Monitoring: Latency and throughput tracking
By leveraging AI observability, teams can:
- Monitor Performance: Track AI operation latency and throughput
- Control Costs: Monitor token usage and optimize spending
- Debug Issues: Trace AI operations through distributed systems
- Optimize Usage: Identify opportunities for caching and optimization
- Ensure Quality: Monitor AI response quality and consistency
- Maintain Reliability: Detect and respond to AI service issues quickly
AI observability is essential for building reliable, performant, and cost-effective AI-powered applications, providing the insights needed to understand, debug, and optimize AI operations at scale.