AI Observability in ConnectSoft Microservice Template¶

Purpose & Overview¶

AI Observability in the ConnectSoft Microservice Template provides comprehensive monitoring, logging, and tracing capabilities for AI services using Microsoft.Extensions.AI. This enables teams to gain deep insights into AI operations, including chat completions, embedding generation, tool invocations, and vector store operations, while maintaining full visibility into performance, costs, and errors.

AI Observability integrates seamlessly with the template's observability stack, leveraging OpenTelemetry, structured logging, and distributed tracing to provide end-to-end visibility into AI-powered operations.

Why AI Observability Matters¶

AI observability provides critical capabilities:

Performance Monitoring: Track latency, throughput, and response times for AI operations
Cost Tracking: Monitor token usage and API costs across all AI providers
Error Detection: Identify and diagnose AI service failures and errors
Usage Analytics: Understand AI service utilization patterns and trends
Quality Assessment: Evaluate AI response quality and consistency
Debugging: Trace AI operations through distributed systems
Optimization: Identify opportunities for caching, batching, and cost reduction

AI Observability Philosophy

AI observability treats AI operations as first-class citizens in the observability stack. By instrumenting AI services with OpenTelemetry, structured logging, and distributed tracing, teams can understand, debug, and optimize AI-powered applications with the same rigor applied to traditional application components.

Architecture Overview¶

AI Observability Stack¶

AI Operations (Chat, Embeddings, Tools)
    ↓
Microsoft.Extensions.AI Middleware
    ├── OpenTelemetry Instrumentation
    │   ├── Traces (Distributed Tracing)
    │   ├── Metrics (Performance & Usage)
    │   └── Logs (Structured Events)
    ├── Structured Logging
    │   ├── Request/Response Logging
    │   ├── Token Usage Tracking
    │   └── Error Logging
    └── Distributed Caching
        ├── Cache Hit/Miss Metrics
        └── Cost Reduction Tracking
    ↓
OpenTelemetry SDK
    ├── ActivitySource: "Experimental.Microsoft.Extensions.AI*"
    ├── Meter: "Experimental.Microsoft.Extensions.AI*"
    └── Log Categories: "Microsoft.Extensions.AI.*"
    ↓
Observability Backends
    ├── Jaeger, Zipkin (traces)
    ├── Prometheus, Grafana (metrics)
    ├── Seq, Application Insights (unified)
    └── Other OTLP-compatible systems

OpenTelemetry Integration¶

Automatic Instrumentation¶

Microsoft.Extensions.AI automatically instruments all AI operations when OpenTelemetry middleware is enabled:

// MicrosoftExtensionsAIExtensions.cs
#if OpenTelemetry
    chatClientBuilder.UseOpenTelemetry();
    embeddingGeneratorClientBuilder.UseOpenTelemetry();
#endif

ActivitySource Registration¶

AI operations are traced using the Experimental.Microsoft.Extensions.AI* ActivitySource:

// OpenTelemetryExtensions.cs
#if UseMicrosoftExtensionsAI
tracingBuilder
    .AddSource("Experimental.Microsoft.Extensions.AI*");
#endif

What's Traced: - Chat completion requests and responses - Embedding generation requests - Tool invocations (function calling) - Vector store operations - Token usage and costs - Latency and errors - Provider-specific operations (OpenAI, Azure OpenAI, Ollama, Azure AI Inference)

Meter Registration¶

AI metrics are collected using the Experimental.Microsoft.Extensions.AI* Meter:

// OpenTelemetryExtensions.cs
#if UseMicrosoftExtensionsAI
metricsBuilder
    .AddMeter("Experimental.Microsoft.Extensions.AI*");
#endif

Metrics Collected: - Request counts (by provider, model, operation type) - Token usage (input tokens, output tokens, total tokens) - Latency (request duration, p50, p95, p99) - Error rates (by provider, error type) - Cache hit/miss rates - Cost estimates (when available)

Trace Attributes¶

AI traces include rich contextual information:

Common Attributes: - ai.provider: Provider name (e.g., "openAI", "azureOpenAI", "ollama") - ai.model: Model identifier (e.g., "gpt-4o", "text-embedding-3-small") - ai.operation.type: Operation type (e.g., "chat.completion", "embedding.generation") - ai.request.id: Unique request identifier - ai.response.id: Response identifier (when available)

Chat Completion Attributes: - ai.chat.message.count: Number of messages in the conversation - ai.chat.tokens.input: Input token count - ai.chat.tokens.output: Output token count - ai.chat.tokens.total: Total token count - ai.chat.finish_reason: Completion finish reason - ai.chat.function_calls: Number of function calls (if any)

Embedding Attributes: - ai.embedding.input.length: Input text length - ai.embedding.dimensions: Embedding vector dimensions - ai.embedding.tokens: Token count for embedding generation

Tool Invocation Attributes: - ai.tool.name: Tool/function name - ai.tool.arguments: Tool arguments (sanitized) - ai.tool.result: Tool execution result (sanitized)

Error Attributes: - error.type: Error type (e.g., "RateLimitError", "AuthenticationError") - error.message: Error message (sanitized) - error.stack_trace: Stack trace (in development only)

Example Trace¶

{
  "traceId": "00-ab12cd34ef567890abcdef1234567890",
  "spanId": "1a2b3c4d5e6f7890",
  "name": "ai.chat.completion",
  "kind": "CLIENT",
  "startTime": "2025-01-15T17:22:58.123Z",
  "endTime": "2025-01-15T17:22:59.456Z",
  "duration": "1333ms",
  "status": "OK",
  "attributes": {
    "ai.provider": "openAI",
    "ai.model": "gpt-4o",
    "ai.operation.type": "chat.completion",
    "ai.request.id": "req-abc123",
    "ai.chat.message.count": 3,
    "ai.chat.tokens.input": 150,
    "ai.chat.tokens.output": 75,
    "ai.chat.tokens.total": 225,
    "ai.chat.finish_reason": "stop",
    "http.method": "POST",
    "http.url": "https://api.openai.com/v1/chat/completions",
    "http.status_code": 200
  }
}

Structured Logging¶

Log Categories¶

AI operations are logged using structured logging with specific log categories:

Microsoft.Extensions.AI.Chat: Chat completion operations
Microsoft.Extensions.AI.Embeddings: Embedding generation operations
Microsoft.Extensions.AI.Tools: Tool invocation operations
Microsoft.Extensions.AI.VectorData: Vector store operations

Logging Middleware¶

Logging is automatically enabled for all AI operations:

// MicrosoftExtensionsAIExtensions.cs
chatClientBuilder.UseLogging();
embeddingGeneratorClientBuilder.UseLogging();

Logged Information¶

Request Logging: - Provider and model information - Request parameters (messages, options, etc.) - Request timestamp - Correlation IDs (trace ID, span ID)

Response Logging: - Response content (truncated in production) - Token usage (input, output, total) - Response latency - Finish reason - Function calls (if any)

Error Logging: - Error type and message - Stack traces (in development) - Retry attempts (if applicable) - Provider-specific error codes

Example Log Entry¶

{
  "timestamp": "2025-01-15T17:22:58.123Z",
  "level": "Information",
  "category": "Microsoft.Extensions.AI.Chat",
  "message": "Chat completion request completed",
  "ai.provider": "openAI",
  "ai.model": "gpt-4o",
  "ai.request.id": "req-abc123",
  "ai.chat.tokens.input": 150,
  "ai.chat.tokens.output": 75,
  "ai.chat.tokens.total": 225,
  "duration_ms": 1333,
  "traceId": "00-ab12cd34ef567890abcdef1234567890",
  "spanId": "1a2b3c4d5e6f7890",
  "service": "OrderService",
  "environment": "Production"
}

Log Levels¶

Trace: Detailed diagnostic information (request/response payloads)
Debug: Diagnostic information for debugging (token usage, timing)
Information: General information about AI operations (request completion)
Warning: Warning conditions (rate limits, retries)
Error: Error conditions (API failures, authentication errors)
Critical: Critical failures (service unavailability)

Metrics¶

Available Metrics¶

Microsoft.Extensions.AI exposes the following metrics:

Request Metrics: - ai.requests.total: Total number of AI requests (counter) - ai.requests.duration: Request duration histogram (histogram) - ai.requests.errors: Number of failed requests (counter)

Token Metrics: - ai.tokens.input: Input token count (counter) - ai.tokens.output: Output token count (counter) - ai.tokens.total: Total token count (counter)

Cache Metrics: - ai.cache.hits: Number of cache hits (counter) - ai.cache.misses: Number of cache misses (counter) - ai.cache.hit_rate: Cache hit rate (gauge)

Provider-Specific Metrics: - Metrics tagged with ai.provider (openAI, azureOpenAI, ollama, azureAIInference) - Metrics tagged with ai.model (model identifier) - Metrics tagged with ai.operation.type (chat.completion, embedding.generation)

Metric Labels¶

All metrics include labels for filtering and aggregation:

ai.provider: AI provider name
ai.model: Model identifier
ai.operation.type: Operation type
service.name: Service name
service.version: Service version
deployment.environment: Environment name

Example Metrics Query (Prometheus)¶

# Total AI requests per provider
sum(rate(ai_requests_total[5m])) by (ai_provider)

# Average request duration by model
avg(ai_requests_duration_seconds) by (ai_model)

# Token usage rate
sum(rate(ai_tokens_total[5m])) by (ai_provider)

# Cache hit rate
sum(rate(ai_cache_hits_total[5m])) / sum(rate(ai_cache_hits_total[5m]) + rate(ai_cache_misses_total[5m]))

# Error rate by provider
sum(rate(ai_requests_errors_total[5m])) by (ai_provider) / sum(rate(ai_requests_total[5m])) by (ai_provider)

Distributed Caching¶

Cache Integration¶

Distributed caching is automatically enabled when configured:

// MicrosoftExtensionsAIExtensions.cs
#if (DistributedCacheInMemory || DistributedCacheRedis)
    chatClientBuilder.UseDistributedCache();
    embeddingGeneratorClientBuilder.UseDistributedCache();
#endif

Cache Observability¶

Caching operations are instrumented with observability:

Cache Metrics: - Cache hit/miss rates - Cache operation latency - Cache size and eviction rates

Cache Logging: - Cache hits (debug level) - Cache misses (debug level) - Cache errors (warning level)

Cache Traces: - Cache lookup operations - Cache write operations - Cache eviction events

Cache Keys¶

Cache keys are generated based on: - Chat Completions: Messages content, model, temperature, and other options - Embeddings: Input text, model, and options

Cache keys are hashed to ensure consistent and efficient caching.

Provider-Specific Observability¶

OpenAI¶

Traces: - OpenAI API calls are traced with full request/response details - Token usage is automatically captured - Rate limit information is included in traces

Metrics: - Request counts and latency - Token usage (input, output, total) - Error rates (including rate limit errors)

Logs: - API request/response logging - Token usage logging - Rate limit warnings

Azure OpenAI¶

Traces: - Azure OpenAI API calls are traced with deployment information - Token usage and cost information - Azure-specific metadata (deployment name, endpoint)

Metrics: - Request counts by deployment - Token usage and costs - Error rates

Logs: - Deployment-specific logging - Azure authentication logging - Cost tracking logs

Ollama¶

Traces: - Local Ollama API calls are traced - Model information and response times - Local execution metrics

Metrics: - Request counts and latency - Model-specific metrics - Local execution performance

Logs: - Local API request/response logging - Model loading and execution logs

Azure AI Inference¶

Traces: - Azure AI Inference API calls are traced - Model catalog information - Inference-specific metadata

Metrics: - Request counts by model - Token usage and costs - Error rates

Logs: - Model catalog logging - Inference request/response logging - Cost tracking logs

Vector Store Observability¶

Vector Store Operations¶

Vector store operations are instrumented when UseVectorStore is enabled:

Traced Operations: - Vector upsert operations - Vector search operations - Collection management operations

Metrics: - Vector operation counts - Search latency - Vector store size

Logs: - Vector operation logging - Search query logging - Collection management logging

Embedding Generator Integration¶

Vector store operations automatically include embedding generator observability:

Embedding generation is traced as part of vector operations
Token usage for embeddings is tracked
Embedding generation latency is measured

Best Practices¶

Do's¶

Enable OpenTelemetry for AI Services

#if OpenTelemetry
    chatClientBuilder.UseOpenTelemetry();
#endif

Use Structured Logging
Logs automatically include trace context
Use log levels appropriately
Include relevant context in log messages
Monitor Token Usage
Track token usage to control costs
Set up alerts for unexpected token usage
Optimize prompts to reduce token consumption
Enable Caching
Use distributed caching to reduce costs
Monitor cache hit rates
Optimize cache keys for better hit rates
Set Up Alerts
Alert on high error rates
Alert on rate limit errors
Alert on unexpected token usage
Correlate AI Operations with Business Logic
Include business context in traces
Link AI operations to user actions
Track AI operations in business workflows

Don'ts¶

Don't Log Sensitive Data

// ❌ BAD - Logging sensitive data
_logger.LogInformation("User prompt: {Prompt}", userPrompt);

// ✅ GOOD - Sanitize or truncate sensitive data
_logger.LogInformation("User prompt length: {Length}", userPrompt.Length);

Don't Over-Instrument
Avoid creating too many spans for simple operations
Use appropriate log levels
Don't log every token usage in production
Don't Ignore Errors
Always log AI errors
Include error context in traces
Set up error alerts
Don't Skip Cost Tracking
Monitor token usage
Track costs per provider
Set up cost alerts

Grafana Dashboards¶

AI Observability Dashboard¶

Create Grafana dashboards to visualize AI operations:

Key Panels: - Request rate by provider - Average latency by model - Token usage over time - Error rate by provider - Cache hit rate - Cost estimates (when available)

Example Dashboard Queries:

# Request rate by provider
sum(rate(ai_requests_total[5m])) by (ai_provider)

# Average latency by model
avg(ai_requests_duration_seconds) by (ai_model)

# Token usage rate
sum(rate(ai_tokens_total[5m])) by (ai_provider)

# Error rate
sum(rate(ai_requests_errors_total[5m])) by (ai_provider) / sum(rate(ai_requests_total[5m])) by (ai_provider)

Alerting¶

Recommended Alerts¶

High Error Rate
Condition: Error rate > 5% for 5 minutes
Severity: Warning
Action: Investigate provider issues
Rate Limit Errors
Condition: Rate limit errors > 0
Severity: Warning
Action: Review rate limits and implement backoff
High Token Usage
Condition: Token usage > threshold
Severity: Info
Action: Review prompt optimization
Slow Response Times
Condition: P95 latency > 5 seconds
Severity: Warning
Action: Investigate performance issues
Cache Miss Rate
Condition: Cache hit rate < 50%
Severity: Info
Action: Review cache configuration

Troubleshooting¶

Issue: No AI Telemetry Appearing¶

Symptoms: AI traces, metrics, or logs not appearing in observability backend.

Solutions: 1. Verify OpenTelemetry Configuration: Ensure UseOpenTelemetry() is called on AI client builders 2. Check ActivitySource Registration: Verify Experimental.Microsoft.Extensions.AI* is registered 3. Check Meter Registration: Verify Experimental.Microsoft.Extensions.AI* meter is registered 4. Review Log Categories: Ensure log categories are not filtered 5. Check Exporter Configuration: Verify telemetry is being exported correctly

Issue: Missing Token Usage Information¶

Symptoms: Token usage not appearing in traces or metrics.

Solutions: 1. Verify Provider Support: Some providers may not expose token usage 2. Check Response Parsing: Ensure responses are being parsed correctly 3. Review Logging Configuration: Token usage may be logged but not traced

Issue: High Observability Overhead¶

Symptoms: Performance degradation with AI observability enabled.

Solutions: 1. Enable Sampling: Use trace sampling to reduce overhead 2. Review Log Levels: Use appropriate log levels in production 3. Optimize Exports: Use efficient exporters (gRPC vs HTTP) 4. Batch Exports: Use batching when available

AI Extensions: Comprehensive guide to Microsoft.Extensions.AI integration
OpenTelemetry: Detailed OpenTelemetry configuration and usage
Logging: Structured logging with trace correlation
Metrics: Metrics collection and instrumentation
Distributed Tracing: Distributed tracing patterns

Summary¶

AI Observability in the ConnectSoft Microservice Template provides:

✅ OpenTelemetry Integration: Automatic instrumentation for all AI operations
✅ Structured Logging: Comprehensive logging with trace correlation
✅ Metrics Collection: Performance, usage, and cost metrics
✅ Distributed Tracing: End-to-end visibility into AI operations
✅ Provider Support: Observability for all AI providers (OpenAI, Azure OpenAI, Ollama, Azure AI Inference)
✅ Cache Observability: Monitoring of distributed caching operations
✅ Vector Store Observability: Instrumentation for vector store operations
✅ Cost Tracking: Token usage and cost monitoring
✅ Error Detection: Comprehensive error logging and tracing
✅ Performance Monitoring: Latency and throughput tracking

By leveraging AI observability, teams can:

Monitor Performance: Track AI operation latency and throughput
Control Costs: Monitor token usage and optimize spending
Debug Issues: Trace AI operations through distributed systems
Optimize Usage: Identify opportunities for caching and optimization
Ensure Quality: Monitor AI response quality and consistency
Maintain Reliability: Detect and respond to AI service issues quickly

AI observability is essential for building reliable, performant, and cost-effective AI-powered applications, providing the insights needed to understand, debug, and optimize AI operations at scale.