Skip to content

Vector Ingestion

Abstract foundation documentation for vector data ingestion patterns and best practices.

Table of Contents

Concepts

Chunking

Chunking is the process of breaking down large content into smaller, manageable pieces suitable for embedding and vector search.

Why Chunk?

  • Embedding models have token limits
  • Smaller chunks improve search precision
  • Enables granular retrieval of relevant content

Chunking Strategies:

  • Token-based: Split by estimated token count
  • Structure-aware: Split by document structure (headers, paragraphs)
  • Semantic: Split by semantic boundaries (future enhancement)

Embedding Generation

Embeddings are numeric vector representations of text that capture semantic meaning.

Key Considerations:

  • Embedding model selection affects quality
  • Vector dimensions vary by model
  • Batch processing improves throughput

Indexing

Indexing stores embeddings in vector stores optimized for similarity search.

Indexing Operations:

  • Upsert: Insert or update records
  • Batch Upsert: Process multiple records efficiently
  • Idempotent Upsert: Safe to retry without duplicates

Architecture Patterns

Provider-Agnostic Design

All ingestion logic uses abstractions from Microsoft.Extensions.AI and Microsoft.Extensions.VectorData:

  • Embeddings: IEmbeddingGenerator<string, Embedding<float>>
  • Vector Stores: VectorStore and VectorStoreCollection<TKey, TRecord>

This ensures the same code works across all providers without modification.

Pipeline Pattern

Input → Chunking → Deduplication → Embedding → Indexing → Output

Each stage is independent and can be optimized separately.

Batch Processing Pattern

Process chunks in configurable batches:

  • Improves throughput
  • Reduces memory pressure
  • Enables progress tracking

Retry Pattern

Exponential backoff retry for transient failures:

  • Network errors
  • Rate limits
  • Temporary service unavailability

Best Practices

Chunking Strategies

Choose the Right Chunker:

  • Token Chunker: General-purpose, works for most content
  • Markdown Chunker: Documentation, structured content
  • Paragraph Chunker: Long-form content, articles

Chunk Size Guidelines:

  • Small (256 tokens): High precision, many chunks
  • Medium (512 tokens): Balanced precision and context
  • Large (1024 tokens): More context, fewer chunks

Overlap Considerations:

  • 10-20% overlap preserves context across boundaries
  • Too much overlap wastes resources
  • Too little overlap loses context

Batch Sizing

Guidelines:

  • Start with 64 chunks per batch
  • Increase for high-throughput scenarios
  • Decrease if memory is constrained
  • Monitor embedding generator rate limits

Throttling:

  • Add delays between batches to respect rate limits
  • Adjust based on provider capabilities
  • Monitor for rate limit errors

Retry Policies

Transient Failures (Retry):

  • Network timeouts
  • HTTP 503 (Service Unavailable)
  • HTTP 502 (Bad Gateway)
  • Rate limit errors (429)

Permanent Failures (Don't Retry):

  • Authentication errors (401, 403)
  • Invalid content (400)
  • Not found errors (404)

Backoff Strategy:

  • Start with 1 second delay
  • Double delay on each retry
  • Cap at maximum delay (e.g., 30 seconds)

Provider Support

Qdrant

Characteristics:

  • High performance
  • Low latency
  • Good for large-scale deployments

Considerations:

  • Supports gRPC and REST APIs
  • Collection management required
  • Good batch upsert performance

Characteristics:

  • Managed service
  • Integrated with Azure ecosystem
  • Index lag may occur

Considerations:

  • Index updates may take time to propagate
  • Rate limits apply
  • Good for enterprise scenarios

pgvector (PostgreSQL)

Characteristics:

  • SQL-based
  • ACID transactions
  • Familiar tooling

Considerations:

  • Requires PostgreSQL with pgvector extension
  • Good for applications already using PostgreSQL
  • Transaction support for consistency

SQL Server

Characteristics:

  • Native SQL Server integration
  • Vector support in SQL Server 2022+
  • Enterprise-grade reliability

Considerations:

  • Requires SQL Server with vector support
  • Good for Microsoft ecosystem
  • Transaction support available

Performance Considerations

Throughput Optimization

Factors Affecting Throughput:

  1. Embedding Generator Latency: Primary bottleneck
  2. Vector Store Performance: Varies by provider
  3. Network Latency: Between services
  4. Batch Size: Larger batches = higher throughput (up to limits)

Optimization Strategies:

  • Increase batch size (within memory limits)
  • Reduce throttle delays (respect rate limits)
  • Use faster embedding generators
  • Choose high-performance vector stores

Latency Reduction

Strategies:

  • Smaller batch sizes for faster feedback
  • Parallel processing for multiple sources
  • Optimize network paths
  • Use local/regional services

Resource Management

Memory:

  • Batch size directly affects memory usage
  • Monitor memory during large ingestions
  • Adjust batch size based on available memory

CPU:

  • Chunking is CPU-intensive
  • Consider parallel chunking for large files
  • Monitor CPU usage during ingestion

Error Handling

Transient vs Permanent Failures

Transient Failures:

  • Network timeouts
  • Service unavailability
  • Rate limits
  • Action: Retry with exponential backoff

Permanent Failures:

  • Invalid content
  • Authentication errors
  • Configuration errors
  • Action: Log error and skip

Error Recovery

Checkpointing:

  • Save progress at batch boundaries
  • Resume from last successful batch
  • Track failed chunks for retry

Partial Success:

  • Continue processing after individual chunk failures
  • Collect errors for review
  • Report success/failure statistics

Deduplication

Hash-Based Identification

Hash Computation:

  • SHA256 hash of chunk text + source metadata
  • Deterministic: same content = same hash
  • Fast computation

Chunk ID Format:

{sourceId}:{chunkIndex}:{hash}

Benefits:

  • Prevents duplicate embeddings
  • Enables idempotent re-ingestion
  • Reduces storage costs

Deduplication Strategy

Before Embedding:

  • Check if chunk exists by ID
  • Skip embedding generation if found
  • Reduces API calls and costs

Implementation:

if (chunkExists)
{
    // Skip embedding generation
    chunksSkipped++;
    continue;
}

Resumability

Checkpoint Pattern

Checkpoint Data:

  • Run ID
  • Processed source IDs
  • Chunk status (processed, indexed, failed)
  • Timestamp

Checkpoint Storage:

  • File system (development)
  • Database (production)
  • Distributed cache (scaled scenarios)

Resume Strategy

On Resume:

  1. Load checkpoint
  2. Identify processed sources
  3. Skip already-processed content
  4. Continue from last checkpoint

Benefits:

  • No lost progress on failures
  • Supports long-running ingestions
  • Enables incremental updates

Implementation Considerations

Checkpoint Frequency:

  • After each batch (recommended)
  • After each source (for large sources)
  • Periodic (for very long runs)

Checkpoint Cleanup:

  • Remove old checkpoints
  • Archive completed runs
  • Monitor checkpoint storage size

Use Case Patterns

Pattern 1: Batch Ingestion

When to Use:

  • Large-scale data migration
  • Initial knowledge base population
  • Periodic bulk updates

Characteristics:

  • High batch sizes (64-128 chunks)
  • Throttling between batches
  • Progress tracking and checkpointing
  • Error collection and reporting

Example Scenarios:

  • Migrating legacy documentation to vector search
  • Indexing historical product catalogs
  • Populating knowledge bases from archives

Pattern 2: Real-Time Ingestion

When to Use:

  • Live content updates
  • Event-driven systems
  • Time-sensitive information

Characteristics:

  • Small batch sizes (8-16 chunks)
  • Minimal throttling
  • Low-latency processing
  • Immediate searchability

Example Scenarios:

  • News article indexing
  • Social media content ingestion
  • Real-time chat message indexing

Pattern 3: Incremental Updates

When to Use:

  • Content that changes frequently
  • Version-controlled documentation
  • Collaborative content

Characteristics:

  • Deduplication enabled
  • Hash-based change detection
  • Selective re-indexing
  • Version tracking in metadata

Example Scenarios:

  • Documentation updates
  • Wiki page changes
  • Product description updates

Pattern 4: Multi-Source Aggregation

When to Use:

  • Combining data from multiple sources
  • Unified search across systems
  • Data integration scenarios

Characteristics:

  • Multiple source identifiers
  • Rich metadata for filtering
  • Source-specific chunking strategies
  • Unified collection or source-specific collections

Example Scenarios:

  • Aggregating support tickets from multiple systems
  • Combining documentation from multiple repositories
  • Unified search across multiple knowledge bases

Pattern 5: Structured Document Processing

When to Use:

  • Legal documents
  • Technical documentation
  • Academic papers
  • Contracts and agreements

Characteristics:

  • Structure-aware chunking (Markdown, Paragraph)
  • Preserve document hierarchy
  • Metadata-rich indexing
  • Citation and reference tracking

Example Scenarios:

  • Legal case file indexing
  • Research paper databases
  • Technical specification indexing

Pattern 6: Multi-Language Content

When to Use:

  • International applications
  • Global knowledge bases
  • Multi-language support systems

Characteristics:

  • Language detection and tagging
  • Language-specific chunking strategies
  • Cross-language search support
  • Cultural context preservation

Example Scenarios:

  • International customer support
  • Multi-language documentation
  • Global product catalogs

Advanced Topics

Custom Chunking Strategies

While the pipeline provides Token, Markdown, and Paragraph chunkers, you can implement custom chunking strategies:

When to Create Custom Chunkers:

  • Domain-specific content structure
  • Specialized formatting requirements
  • Performance optimizations
  • Unique metadata requirements

Implementation Considerations:

  • Implement IChunker interface
  • Ensure deterministic chunking
  • Preserve relevant metadata
  • Handle edge cases gracefully

Metadata Strategy

Best Practices:

  • Include source identifiers for traceability
  • Add timestamps for version tracking
  • Store filtering attributes (category, type, status)
  • Preserve relationships (parent-child, references)
  • Keep metadata lightweight (avoid large objects)

Metadata Patterns:

  • Source Tracking: source_id, source_type, source_version
  • Temporal: created_at, updated_at, expires_at
  • Categorization: category, tags, classification
  • Relationships: parent_id, related_ids, references

Performance Optimization

Embedding Generation:

  • Batch embedding requests when possible
  • Cache embeddings for identical content
  • Use appropriate embedding models for your use case
  • Monitor embedding API rate limits

Vector Store Operations:

  • Use batch upserts for efficiency
  • Optimize collection indexes
  • Consider vector dimensions and their impact
  • Monitor storage growth and cleanup

Chunking Optimization:

  • Choose chunk size based on embedding model limits
  • Balance chunk count vs. context preservation
  • Consider overlap requirements
  • Profile chunking performance for large documents

Monitoring and Observability

Key Metrics to Track:

  • Chunks processed per second
  • Embedding generation latency
  • Vector store upsert latency
  • Deduplication hit rate
  • Error rates by type
  • Batch processing efficiency

Logging Best Practices:

  • Log ingestion start/completion with timing
  • Log chunk counts and statistics
  • Log errors with context (chunk ID, source ID)
  • Use structured logging for analysis
  • Include correlation IDs for tracing

Health Checks:

  • Monitor vector store connectivity
  • Verify embedding generator availability
  • Check collection existence and health
  • Track ingestion queue depth (if applicable)

Security Considerations

Data Privacy:

  • Sanitize sensitive information before ingestion
  • Use field-level encryption for sensitive metadata
  • Implement access controls on collections
  • Audit ingestion operations

Authentication:

  • Secure embedding generator API keys
  • Use managed identities where possible
  • Rotate credentials regularly
  • Monitor for unauthorized access

Compliance:

  • Track data lineage through metadata
  • Implement data retention policies
  • Support data deletion requests
  • Maintain audit logs

Scaling Strategies

Horizontal Scaling:

  • Distribute ingestion across multiple instances
  • Use message queues for ingestion requests
  • Implement idempotent ingestion operations
  • Coordinate checkpointing across instances

Vertical Scaling:

  • Increase batch sizes for higher throughput
  • Optimize memory usage for large batches
  • Use async/await for I/O operations
  • Profile and optimize hot paths

Resource Management:

  • Monitor memory usage during batch processing
  • Implement backpressure for high-load scenarios
  • Use connection pooling for vector stores
  • Implement circuit breakers for external services

Integration Patterns

Event-Driven Ingestion:

  • Subscribe to content change events
  • Process events asynchronously
  • Handle event ordering and idempotency
  • Implement dead-letter queues for failures

Scheduled Ingestion:

  • Use cron jobs or schedulers for periodic ingestion
  • Implement incremental update detection
  • Track last ingestion timestamps
  • Handle missed schedules gracefully

API-Driven Ingestion:

  • Expose ingestion endpoints
  • Validate input before processing
  • Return ingestion status and results
  • Support bulk ingestion requests

Troubleshooting Common Issues

Issue: Low Deduplication Hit Rate

  • Cause: Chunk IDs not deterministic or cache cleared
  • Solution: Verify hash computation, implement persistent cache

Issue: High Embedding API Costs

  • Cause: Re-generating embeddings for duplicate content
  • Solution: Improve deduplication, cache embeddings

Issue: Slow Ingestion Performance

  • Cause: Small batch sizes, high throttling, network latency
  • Solution: Increase batch size, reduce throttling, optimize network

Issue: Memory Pressure

  • Cause: Large batch sizes, too many concurrent operations
  • Solution: Reduce batch size, limit concurrency, implement streaming

Issue: Inconsistent Chunk Counts

  • Cause: Non-deterministic chunking, configuration changes
  • Solution: Verify chunker determinism, version configuration

References