Vector Ingestion¶

Abstract foundation documentation for vector data ingestion patterns and best practices.

Table of Contents¶

Concepts
Architecture Patterns
Best Practices
Provider Support
Performance Considerations
Error Handling
Deduplication
Resumability
Use Case Patterns
Advanced Topics

Concepts¶

Chunking¶

Chunking is the process of breaking down large content into smaller, manageable pieces suitable for embedding and vector search.

Why Chunk?

Embedding models have token limits
Smaller chunks improve search precision
Enables granular retrieval of relevant content

Chunking Strategies:

Token-based: Split by estimated token count
Structure-aware: Split by document structure (headers, paragraphs)
Semantic: Split by semantic boundaries (future enhancement)

Embedding Generation¶

Embeddings are numeric vector representations of text that capture semantic meaning.

Key Considerations:

Embedding model selection affects quality
Vector dimensions vary by model
Batch processing improves throughput

Indexing¶

Indexing stores embeddings in vector stores optimized for similarity search.

Indexing Operations:

Upsert: Insert or update records
Batch Upsert: Process multiple records efficiently
Idempotent Upsert: Safe to retry without duplicates

Architecture Patterns¶

Provider-Agnostic Design¶

All ingestion logic uses abstractions from Microsoft.Extensions.AI and Microsoft.Extensions.VectorData:

Embeddings: IEmbeddingGenerator<string, Embedding<float>>
Vector Stores: VectorStore and VectorStoreCollection<TKey, TRecord>

This ensures the same code works across all providers without modification.

Pipeline Pattern¶

Input → Chunking → Deduplication → Embedding → Indexing → Output

Each stage is independent and can be optimized separately.

Batch Processing Pattern¶

Process chunks in configurable batches:

Improves throughput
Reduces memory pressure
Enables progress tracking

Retry Pattern¶

Exponential backoff retry for transient failures:

Network errors
Rate limits
Temporary service unavailability

Best Practices¶

Chunking Strategies¶

Choose the Right Chunker:

Token Chunker: General-purpose, works for most content
Markdown Chunker: Documentation, structured content
Paragraph Chunker: Long-form content, articles

Chunk Size Guidelines:

Small (256 tokens): High precision, many chunks
Medium (512 tokens): Balanced precision and context
Large (1024 tokens): More context, fewer chunks

Overlap Considerations:

10-20% overlap preserves context across boundaries
Too much overlap wastes resources
Too little overlap loses context

Batch Sizing¶

Guidelines:

Start with 64 chunks per batch
Increase for high-throughput scenarios
Decrease if memory is constrained
Monitor embedding generator rate limits

Throttling:

Add delays between batches to respect rate limits
Adjust based on provider capabilities
Monitor for rate limit errors

Retry Policies¶

Transient Failures (Retry):

Network timeouts
HTTP 503 (Service Unavailable)
HTTP 502 (Bad Gateway)
Rate limit errors (429)

Permanent Failures (Don't Retry):

Authentication errors (401, 403)
Invalid content (400)
Not found errors (404)

Backoff Strategy:

Start with 1 second delay
Double delay on each retry
Cap at maximum delay (e.g., 30 seconds)

Provider Support¶

Qdrant¶

Characteristics:

High performance
Low latency
Good for large-scale deployments

Considerations:

Supports gRPC and REST APIs
Collection management required
Good batch upsert performance

Azure AI Search¶

Characteristics:

Managed service
Integrated with Azure ecosystem
Index lag may occur

Considerations:

Index updates may take time to propagate
Rate limits apply
Good for enterprise scenarios

pgvector (PostgreSQL)¶

Characteristics:

SQL-based
ACID transactions
Familiar tooling

Considerations:

Requires PostgreSQL with pgvector extension
Good for applications already using PostgreSQL
Transaction support for consistency

SQL Server¶

Characteristics:

Native SQL Server integration
Vector support in SQL Server 2022+
Enterprise-grade reliability

Considerations:

Requires SQL Server with vector support
Good for Microsoft ecosystem
Transaction support available

Performance Considerations¶

Throughput Optimization¶

Factors Affecting Throughput:

Embedding Generator Latency: Primary bottleneck
Vector Store Performance: Varies by provider
Network Latency: Between services
Batch Size: Larger batches = higher throughput (up to limits)

Optimization Strategies:

Increase batch size (within memory limits)
Reduce throttle delays (respect rate limits)
Use faster embedding generators
Choose high-performance vector stores

Latency Reduction¶

Strategies:

Smaller batch sizes for faster feedback
Parallel processing for multiple sources
Optimize network paths
Use local/regional services

Resource Management¶

Memory:

Batch size directly affects memory usage
Monitor memory during large ingestions
Adjust batch size based on available memory

CPU:

Chunking is CPU-intensive
Consider parallel chunking for large files
Monitor CPU usage during ingestion

Error Handling¶

Transient vs Permanent Failures¶

Transient Failures:

Network timeouts
Service unavailability
Rate limits
Action: Retry with exponential backoff

Permanent Failures:

Invalid content
Authentication errors
Configuration errors
Action: Log error and skip

Error Recovery¶

Checkpointing:

Save progress at batch boundaries
Resume from last successful batch
Track failed chunks for retry

Partial Success:

Continue processing after individual chunk failures
Collect errors for review
Report success/failure statistics

Deduplication¶

Hash-Based Identification¶

Hash Computation:

SHA256 hash of chunk text + source metadata
Deterministic: same content = same hash
Fast computation

Chunk ID Format:

{sourceId}:{chunkIndex}:{hash}

Benefits:

Prevents duplicate embeddings
Enables idempotent re-ingestion
Reduces storage costs

Deduplication Strategy¶

Before Embedding:

Check if chunk exists by ID
Skip embedding generation if found
Reduces API calls and costs

Implementation:

if (chunkExists)
{
    // Skip embedding generation
    chunksSkipped++;
    continue;
}

Resumability¶

Checkpoint Pattern¶

Checkpoint Data:

Run ID
Processed source IDs
Chunk status (processed, indexed, failed)
Timestamp

Checkpoint Storage:

File system (development)
Database (production)
Distributed cache (scaled scenarios)

Resume Strategy¶

On Resume:

Load checkpoint
Identify processed sources
Skip already-processed content
Continue from last checkpoint

Benefits:

No lost progress on failures
Supports long-running ingestions
Enables incremental updates

Implementation Considerations¶

Checkpoint Frequency:

After each batch (recommended)
After each source (for large sources)
Periodic (for very long runs)

Checkpoint Cleanup:

Remove old checkpoints
Archive completed runs
Monitor checkpoint storage size

Use Case Patterns¶

Pattern 1: Batch Ingestion¶

When to Use:

Large-scale data migration
Initial knowledge base population
Periodic bulk updates

Characteristics:

High batch sizes (64-128 chunks)
Throttling between batches
Progress tracking and checkpointing
Error collection and reporting

Example Scenarios:

Migrating legacy documentation to vector search
Indexing historical product catalogs
Populating knowledge bases from archives

Pattern 2: Real-Time Ingestion¶

When to Use:

Live content updates
Event-driven systems
Time-sensitive information

Characteristics:

Small batch sizes (8-16 chunks)
Minimal throttling
Low-latency processing
Immediate searchability

Example Scenarios:

News article indexing
Social media content ingestion
Real-time chat message indexing

Pattern 3: Incremental Updates¶

When to Use:

Content that changes frequently
Version-controlled documentation
Collaborative content

Characteristics:

Deduplication enabled
Hash-based change detection
Selective re-indexing
Version tracking in metadata

Example Scenarios:

Documentation updates
Wiki page changes
Product description updates

Pattern 4: Multi-Source Aggregation¶

When to Use:

Combining data from multiple sources
Unified search across systems
Data integration scenarios

Characteristics:

Multiple source identifiers
Rich metadata for filtering
Source-specific chunking strategies
Unified collection or source-specific collections

Example Scenarios:

Aggregating support tickets from multiple systems
Combining documentation from multiple repositories
Unified search across multiple knowledge bases

Pattern 5: Structured Document Processing¶

When to Use:

Legal documents
Technical documentation
Academic papers
Contracts and agreements

Characteristics:

Structure-aware chunking (Markdown, Paragraph)
Preserve document hierarchy
Metadata-rich indexing
Citation and reference tracking

Example Scenarios:

Legal case file indexing
Research paper databases
Technical specification indexing

Pattern 6: Multi-Language Content¶

When to Use:

International applications
Global knowledge bases
Multi-language support systems

Characteristics:

Language detection and tagging
Language-specific chunking strategies
Cross-language search support
Cultural context preservation

Example Scenarios:

International customer support
Multi-language documentation
Global product catalogs

Advanced Topics¶

Custom Chunking Strategies¶

While the pipeline provides Token, Markdown, and Paragraph chunkers, you can implement custom chunking strategies:

When to Create Custom Chunkers:

Domain-specific content structure
Specialized formatting requirements
Performance optimizations
Unique metadata requirements

Implementation Considerations:

Implement IChunker interface
Ensure deterministic chunking
Preserve relevant metadata
Handle edge cases gracefully

Metadata Strategy¶

Best Practices:

Include source identifiers for traceability
Add timestamps for version tracking
Store filtering attributes (category, type, status)
Preserve relationships (parent-child, references)
Keep metadata lightweight (avoid large objects)

Metadata Patterns:

Source Tracking: source_id, source_type, source_version
Temporal: created_at, updated_at, expires_at
Categorization: category, tags, classification
Relationships: parent_id, related_ids, references

Performance Optimization¶

Embedding Generation:

Batch embedding requests when possible
Cache embeddings for identical content
Use appropriate embedding models for your use case
Monitor embedding API rate limits

Vector Store Operations:

Use batch upserts for efficiency
Optimize collection indexes
Consider vector dimensions and their impact
Monitor storage growth and cleanup

Chunking Optimization:

Choose chunk size based on embedding model limits
Balance chunk count vs. context preservation
Consider overlap requirements
Profile chunking performance for large documents

Monitoring and Observability¶

Key Metrics to Track:

Chunks processed per second
Embedding generation latency
Vector store upsert latency
Deduplication hit rate
Error rates by type
Batch processing efficiency

Logging Best Practices:

Log ingestion start/completion with timing
Log chunk counts and statistics
Log errors with context (chunk ID, source ID)
Use structured logging for analysis
Include correlation IDs for tracing

Health Checks:

Monitor vector store connectivity
Verify embedding generator availability
Check collection existence and health
Track ingestion queue depth (if applicable)

Security Considerations¶

Data Privacy:

Sanitize sensitive information before ingestion
Use field-level encryption for sensitive metadata
Implement access controls on collections
Audit ingestion operations

Authentication:

Secure embedding generator API keys
Use managed identities where possible
Rotate credentials regularly
Monitor for unauthorized access

Compliance:

Track data lineage through metadata
Implement data retention policies
Support data deletion requests
Maintain audit logs

Scaling Strategies¶

Horizontal Scaling:

Distribute ingestion across multiple instances
Use message queues for ingestion requests
Implement idempotent ingestion operations
Coordinate checkpointing across instances

Vertical Scaling:

Increase batch sizes for higher throughput
Optimize memory usage for large batches
Use async/await for I/O operations
Profile and optimize hot paths

Resource Management:

Monitor memory usage during batch processing
Implement backpressure for high-load scenarios
Use connection pooling for vector stores
Implement circuit breakers for external services

Integration Patterns¶

Event-Driven Ingestion:

Subscribe to content change events
Process events asynchronously
Handle event ordering and idempotency
Implement dead-letter queues for failures

Scheduled Ingestion:

Use cron jobs or schedulers for periodic ingestion
Implement incremental update detection
Track last ingestion timestamps
Handle missed schedules gracefully

API-Driven Ingestion:

Expose ingestion endpoints
Validate input before processing
Return ingestion status and results
Support bulk ingestion requests

Troubleshooting Common Issues¶

Issue: Low Deduplication Hit Rate

Cause: Chunk IDs not deterministic or cache cleared
Solution: Verify hash computation, implement persistent cache

Issue: High Embedding API Costs

Cause: Re-generating embeddings for duplicate content
Solution: Improve deduplication, cache embeddings

Issue: Slow Ingestion Performance

Cause: Small batch sizes, high throttling, network latency
Solution: Increase batch size, reduce throttling, optimize network

Issue: Memory Pressure

Cause: Large batch sizes, too many concurrent operations
Solution: Reduce batch size, limit concurrency, implement streaming

Issue: Inconsistent Chunk Counts

Cause: Non-deterministic chunking, configuration changes
Solution: Verify chunker determinism, version configuration

Vector Ingestion¶

Table of Contents¶

Concepts¶

Chunking¶

Embedding Generation¶

Indexing¶

Architecture Patterns¶

Provider-Agnostic Design¶

Pipeline Pattern¶

Batch Processing Pattern¶

Retry Pattern¶

Best Practices¶

Chunking Strategies¶

Batch Sizing¶

Retry Policies¶

Provider Support¶

Qdrant¶

Azure AI Search¶

pgvector (PostgreSQL)¶

SQL Server¶

Performance Considerations¶

Throughput Optimization¶

Latency Reduction¶

Resource Management¶

Error Handling¶

Transient vs Permanent Failures¶

Error Recovery¶

Deduplication¶

Hash-Based Identification¶

Deduplication Strategy¶

Resumability¶

Checkpoint Pattern¶

Resume Strategy¶

Implementation Considerations¶

Use Case Patterns¶

Pattern 1: Batch Ingestion¶

Pattern 2: Real-Time Ingestion¶

Pattern 3: Incremental Updates¶

Pattern 4: Multi-Source Aggregation¶

Pattern 5: Structured Document Processing¶

Pattern 6: Multi-Language Content¶

Advanced Topics¶

Custom Chunking Strategies¶

Metadata Strategy¶

Performance Optimization¶

Monitoring and Observability¶

Security Considerations¶

Scaling Strategies¶

Integration Patterns¶

Troubleshooting Common Issues¶

References¶