Skip to content

AI Observability in ConnectSoft Microservice Template

Purpose & Overview

AI Observability in the ConnectSoft Microservice Template provides comprehensive monitoring, logging, and tracing capabilities for AI services using Microsoft.Extensions.AI. This enables teams to gain deep insights into AI operations, including chat completions, embedding generation, tool invocations, and vector store operations, while maintaining full visibility into performance, costs, and errors.

AI Observability integrates seamlessly with the template's observability stack, leveraging OpenTelemetry, structured logging, and distributed tracing to provide end-to-end visibility into AI-powered operations.

Why AI Observability Matters

AI observability provides critical capabilities:

  • Performance Monitoring: Track latency, throughput, and response times for AI operations
  • Cost Tracking: Monitor token usage and API costs across all AI providers
  • Error Detection: Identify and diagnose AI service failures and errors
  • Usage Analytics: Understand AI service utilization patterns and trends
  • Quality Assessment: Evaluate AI response quality and consistency
  • Debugging: Trace AI operations through distributed systems
  • Optimization: Identify opportunities for caching, batching, and cost reduction

AI Observability Philosophy

AI observability treats AI operations as first-class citizens in the observability stack. By instrumenting AI services with OpenTelemetry, structured logging, and distributed tracing, teams can understand, debug, and optimize AI-powered applications with the same rigor applied to traditional application components.

Architecture Overview

AI Observability Stack

AI Operations (Chat, Embeddings, Tools)
Microsoft.Extensions.AI Middleware
    ├── OpenTelemetry Instrumentation
    │   ├── Traces (Distributed Tracing)
    │   ├── Metrics (Performance & Usage)
    │   └── Logs (Structured Events)
    ├── Structured Logging
    │   ├── Request/Response Logging
    │   ├── Token Usage Tracking
    │   └── Error Logging
    └── Distributed Caching
        ├── Cache Hit/Miss Metrics
        └── Cost Reduction Tracking
OpenTelemetry SDK
    ├── ActivitySource: "Experimental.Microsoft.Extensions.AI*"
    ├── Meter: "Experimental.Microsoft.Extensions.AI*"
    └── Log Categories: "Microsoft.Extensions.AI.*"
Observability Backends
    ├── Jaeger, Zipkin (traces)
    ├── Prometheus, Grafana (metrics)
    ├── Seq, Application Insights (unified)
    └── Other OTLP-compatible systems

OpenTelemetry Integration

Automatic Instrumentation

Microsoft.Extensions.AI automatically instruments all AI operations when OpenTelemetry middleware is enabled:

// MicrosoftExtensionsAIExtensions.cs
#if OpenTelemetry
    chatClientBuilder.UseOpenTelemetry();
    embeddingGeneratorClientBuilder.UseOpenTelemetry();
#endif

ActivitySource Registration

AI operations are traced using the Experimental.Microsoft.Extensions.AI* ActivitySource:

// OpenTelemetryExtensions.cs
#if UseMicrosoftExtensionsAI
tracingBuilder
    .AddSource("Experimental.Microsoft.Extensions.AI*");
#endif

What's Traced: - Chat completion requests and responses - Embedding generation requests - Tool invocations (function calling) - Vector store operations - Token usage and costs - Latency and errors - Provider-specific operations (OpenAI, Azure OpenAI, Ollama, Azure AI Inference)

Meter Registration

AI metrics are collected using the Experimental.Microsoft.Extensions.AI* Meter:

// OpenTelemetryExtensions.cs
#if UseMicrosoftExtensionsAI
metricsBuilder
    .AddMeter("Experimental.Microsoft.Extensions.AI*");
#endif

Metrics Collected: - Request counts (by provider, model, operation type) - Token usage (input tokens, output tokens, total tokens) - Latency (request duration, p50, p95, p99) - Error rates (by provider, error type) - Cache hit/miss rates - Cost estimates (when available)

Trace Attributes

AI traces include rich contextual information:

Common Attributes: - ai.provider: Provider name (e.g., "openAI", "azureOpenAI", "ollama") - ai.model: Model identifier (e.g., "gpt-4o", "text-embedding-3-small") - ai.operation.type: Operation type (e.g., "chat.completion", "embedding.generation") - ai.request.id: Unique request identifier - ai.response.id: Response identifier (when available)

Chat Completion Attributes: - ai.chat.message.count: Number of messages in the conversation - ai.chat.tokens.input: Input token count - ai.chat.tokens.output: Output token count - ai.chat.tokens.total: Total token count - ai.chat.finish_reason: Completion finish reason - ai.chat.function_calls: Number of function calls (if any)

Embedding Attributes: - ai.embedding.input.length: Input text length - ai.embedding.dimensions: Embedding vector dimensions - ai.embedding.tokens: Token count for embedding generation

Tool Invocation Attributes: - ai.tool.name: Tool/function name - ai.tool.arguments: Tool arguments (sanitized) - ai.tool.result: Tool execution result (sanitized)

Error Attributes: - error.type: Error type (e.g., "RateLimitError", "AuthenticationError") - error.message: Error message (sanitized) - error.stack_trace: Stack trace (in development only)

Example Trace

{
  "traceId": "00-ab12cd34ef567890abcdef1234567890",
  "spanId": "1a2b3c4d5e6f7890",
  "name": "ai.chat.completion",
  "kind": "CLIENT",
  "startTime": "2025-01-15T17:22:58.123Z",
  "endTime": "2025-01-15T17:22:59.456Z",
  "duration": "1333ms",
  "status": "OK",
  "attributes": {
    "ai.provider": "openAI",
    "ai.model": "gpt-4o",
    "ai.operation.type": "chat.completion",
    "ai.request.id": "req-abc123",
    "ai.chat.message.count": 3,
    "ai.chat.tokens.input": 150,
    "ai.chat.tokens.output": 75,
    "ai.chat.tokens.total": 225,
    "ai.chat.finish_reason": "stop",
    "http.method": "POST",
    "http.url": "https://api.openai.com/v1/chat/completions",
    "http.status_code": 200
  }
}

Structured Logging

Log Categories

AI operations are logged using structured logging with specific log categories:

  • Microsoft.Extensions.AI.Chat: Chat completion operations
  • Microsoft.Extensions.AI.Embeddings: Embedding generation operations
  • Microsoft.Extensions.AI.Tools: Tool invocation operations
  • Microsoft.Extensions.AI.VectorData: Vector store operations

Logging Middleware

Logging is automatically enabled for all AI operations:

// MicrosoftExtensionsAIExtensions.cs
chatClientBuilder.UseLogging();
embeddingGeneratorClientBuilder.UseLogging();

Logged Information

Request Logging: - Provider and model information - Request parameters (messages, options, etc.) - Request timestamp - Correlation IDs (trace ID, span ID)

Response Logging: - Response content (truncated in production) - Token usage (input, output, total) - Response latency - Finish reason - Function calls (if any)

Error Logging: - Error type and message - Stack traces (in development) - Retry attempts (if applicable) - Provider-specific error codes

Example Log Entry

{
  "timestamp": "2025-01-15T17:22:58.123Z",
  "level": "Information",
  "category": "Microsoft.Extensions.AI.Chat",
  "message": "Chat completion request completed",
  "ai.provider": "openAI",
  "ai.model": "gpt-4o",
  "ai.request.id": "req-abc123",
  "ai.chat.tokens.input": 150,
  "ai.chat.tokens.output": 75,
  "ai.chat.tokens.total": 225,
  "duration_ms": 1333,
  "traceId": "00-ab12cd34ef567890abcdef1234567890",
  "spanId": "1a2b3c4d5e6f7890",
  "service": "OrderService",
  "environment": "Production"
}

Log Levels

  • Trace: Detailed diagnostic information (request/response payloads)
  • Debug: Diagnostic information for debugging (token usage, timing)
  • Information: General information about AI operations (request completion)
  • Warning: Warning conditions (rate limits, retries)
  • Error: Error conditions (API failures, authentication errors)
  • Critical: Critical failures (service unavailability)

Metrics

Available Metrics

Microsoft.Extensions.AI exposes the following metrics:

Request Metrics: - ai.requests.total: Total number of AI requests (counter) - ai.requests.duration: Request duration histogram (histogram) - ai.requests.errors: Number of failed requests (counter)

Token Metrics: - ai.tokens.input: Input token count (counter) - ai.tokens.output: Output token count (counter) - ai.tokens.total: Total token count (counter)

Cache Metrics: - ai.cache.hits: Number of cache hits (counter) - ai.cache.misses: Number of cache misses (counter) - ai.cache.hit_rate: Cache hit rate (gauge)

Provider-Specific Metrics: - Metrics tagged with ai.provider (openAI, azureOpenAI, ollama, azureAIInference) - Metrics tagged with ai.model (model identifier) - Metrics tagged with ai.operation.type (chat.completion, embedding.generation)

Metric Labels

All metrics include labels for filtering and aggregation:

  • ai.provider: AI provider name
  • ai.model: Model identifier
  • ai.operation.type: Operation type
  • service.name: Service name
  • service.version: Service version
  • deployment.environment: Environment name

Example Metrics Query (Prometheus)

# Total AI requests per provider
sum(rate(ai_requests_total[5m])) by (ai_provider)

# Average request duration by model
avg(ai_requests_duration_seconds) by (ai_model)

# Token usage rate
sum(rate(ai_tokens_total[5m])) by (ai_provider)

# Cache hit rate
sum(rate(ai_cache_hits_total[5m])) / sum(rate(ai_cache_hits_total[5m]) + rate(ai_cache_misses_total[5m]))

# Error rate by provider
sum(rate(ai_requests_errors_total[5m])) by (ai_provider) / sum(rate(ai_requests_total[5m])) by (ai_provider)

Distributed Caching

Cache Integration

Distributed caching is automatically enabled when configured:

// MicrosoftExtensionsAIExtensions.cs
#if (DistributedCacheInMemory || DistributedCacheRedis)
    chatClientBuilder.UseDistributedCache();
    embeddingGeneratorClientBuilder.UseDistributedCache();
#endif

Cache Observability

Caching operations are instrumented with observability:

Cache Metrics: - Cache hit/miss rates - Cache operation latency - Cache size and eviction rates

Cache Logging: - Cache hits (debug level) - Cache misses (debug level) - Cache errors (warning level)

Cache Traces: - Cache lookup operations - Cache write operations - Cache eviction events

Cache Keys

Cache keys are generated based on: - Chat Completions: Messages content, model, temperature, and other options - Embeddings: Input text, model, and options

Cache keys are hashed to ensure consistent and efficient caching.

Provider-Specific Observability

OpenAI

Traces: - OpenAI API calls are traced with full request/response details - Token usage is automatically captured - Rate limit information is included in traces

Metrics: - Request counts and latency - Token usage (input, output, total) - Error rates (including rate limit errors)

Logs: - API request/response logging - Token usage logging - Rate limit warnings

Azure OpenAI

Traces: - Azure OpenAI API calls are traced with deployment information - Token usage and cost information - Azure-specific metadata (deployment name, endpoint)

Metrics: - Request counts by deployment - Token usage and costs - Error rates

Logs: - Deployment-specific logging - Azure authentication logging - Cost tracking logs

Ollama

Traces: - Local Ollama API calls are traced - Model information and response times - Local execution metrics

Metrics: - Request counts and latency - Model-specific metrics - Local execution performance

Logs: - Local API request/response logging - Model loading and execution logs

Azure AI Inference

Traces: - Azure AI Inference API calls are traced - Model catalog information - Inference-specific metadata

Metrics: - Request counts by model - Token usage and costs - Error rates

Logs: - Model catalog logging - Inference request/response logging - Cost tracking logs

Vector Store Observability

Vector Store Operations

Vector store operations are instrumented when UseVectorStore is enabled:

Traced Operations: - Vector upsert operations - Vector search operations - Collection management operations

Metrics: - Vector operation counts - Search latency - Vector store size

Logs: - Vector operation logging - Search query logging - Collection management logging

Embedding Generator Integration

Vector store operations automatically include embedding generator observability:

  • Embedding generation is traced as part of vector operations
  • Token usage for embeddings is tracked
  • Embedding generation latency is measured

Best Practices

Do's

  1. Enable OpenTelemetry for AI Services

    #if OpenTelemetry
        chatClientBuilder.UseOpenTelemetry();
    #endif
    

  2. Use Structured Logging

  3. Logs automatically include trace context
  4. Use log levels appropriately
  5. Include relevant context in log messages

  6. Monitor Token Usage

  7. Track token usage to control costs
  8. Set up alerts for unexpected token usage
  9. Optimize prompts to reduce token consumption

  10. Enable Caching

  11. Use distributed caching to reduce costs
  12. Monitor cache hit rates
  13. Optimize cache keys for better hit rates

  14. Set Up Alerts

  15. Alert on high error rates
  16. Alert on rate limit errors
  17. Alert on unexpected token usage

  18. Correlate AI Operations with Business Logic

  19. Include business context in traces
  20. Link AI operations to user actions
  21. Track AI operations in business workflows

Don'ts

  1. Don't Log Sensitive Data

    // ❌ BAD - Logging sensitive data
    _logger.LogInformation("User prompt: {Prompt}", userPrompt);
    
    // ✅ GOOD - Sanitize or truncate sensitive data
    _logger.LogInformation("User prompt length: {Length}", userPrompt.Length);
    

  2. Don't Over-Instrument

  3. Avoid creating too many spans for simple operations
  4. Use appropriate log levels
  5. Don't log every token usage in production

  6. Don't Ignore Errors

  7. Always log AI errors
  8. Include error context in traces
  9. Set up error alerts

  10. Don't Skip Cost Tracking

  11. Monitor token usage
  12. Track costs per provider
  13. Set up cost alerts

Grafana Dashboards

AI Observability Dashboard

Create Grafana dashboards to visualize AI operations:

Key Panels: - Request rate by provider - Average latency by model - Token usage over time - Error rate by provider - Cache hit rate - Cost estimates (when available)

Example Dashboard Queries:

# Request rate by provider
sum(rate(ai_requests_total[5m])) by (ai_provider)

# Average latency by model
avg(ai_requests_duration_seconds) by (ai_model)

# Token usage rate
sum(rate(ai_tokens_total[5m])) by (ai_provider)

# Error rate
sum(rate(ai_requests_errors_total[5m])) by (ai_provider) / sum(rate(ai_requests_total[5m])) by (ai_provider)

Alerting

  1. High Error Rate
  2. Condition: Error rate > 5% for 5 minutes
  3. Severity: Warning
  4. Action: Investigate provider issues

  5. Rate Limit Errors

  6. Condition: Rate limit errors > 0
  7. Severity: Warning
  8. Action: Review rate limits and implement backoff

  9. High Token Usage

  10. Condition: Token usage > threshold
  11. Severity: Info
  12. Action: Review prompt optimization

  13. Slow Response Times

  14. Condition: P95 latency > 5 seconds
  15. Severity: Warning
  16. Action: Investigate performance issues

  17. Cache Miss Rate

  18. Condition: Cache hit rate < 50%
  19. Severity: Info
  20. Action: Review cache configuration

Troubleshooting

Issue: No AI Telemetry Appearing

Symptoms: AI traces, metrics, or logs not appearing in observability backend.

Solutions: 1. Verify OpenTelemetry Configuration: Ensure UseOpenTelemetry() is called on AI client builders 2. Check ActivitySource Registration: Verify Experimental.Microsoft.Extensions.AI* is registered 3. Check Meter Registration: Verify Experimental.Microsoft.Extensions.AI* meter is registered 4. Review Log Categories: Ensure log categories are not filtered 5. Check Exporter Configuration: Verify telemetry is being exported correctly

Issue: Missing Token Usage Information

Symptoms: Token usage not appearing in traces or metrics.

Solutions: 1. Verify Provider Support: Some providers may not expose token usage 2. Check Response Parsing: Ensure responses are being parsed correctly 3. Review Logging Configuration: Token usage may be logged but not traced

Issue: High Observability Overhead

Symptoms: Performance degradation with AI observability enabled.

Solutions: 1. Enable Sampling: Use trace sampling to reduce overhead 2. Review Log Levels: Use appropriate log levels in production 3. Optimize Exports: Use efficient exporters (gRPC vs HTTP) 4. Batch Exports: Use batching when available

Summary

AI Observability in the ConnectSoft Microservice Template provides:

  • OpenTelemetry Integration: Automatic instrumentation for all AI operations
  • Structured Logging: Comprehensive logging with trace correlation
  • Metrics Collection: Performance, usage, and cost metrics
  • Distributed Tracing: End-to-end visibility into AI operations
  • Provider Support: Observability for all AI providers (OpenAI, Azure OpenAI, Ollama, Azure AI Inference)
  • Cache Observability: Monitoring of distributed caching operations
  • Vector Store Observability: Instrumentation for vector store operations
  • Cost Tracking: Token usage and cost monitoring
  • Error Detection: Comprehensive error logging and tracing
  • Performance Monitoring: Latency and throughput tracking

By leveraging AI observability, teams can:

  • Monitor Performance: Track AI operation latency and throughput
  • Control Costs: Monitor token usage and optimize spending
  • Debug Issues: Trace AI operations through distributed systems
  • Optimize Usage: Identify opportunities for caching and optimization
  • Ensure Quality: Monitor AI response quality and consistency
  • Maintain Reliability: Detect and respond to AI service issues quickly

AI observability is essential for building reliable, performant, and cost-effective AI-powered applications, providing the insights needed to understand, debug, and optimize AI operations at scale.