Vector Ingestion¶
Abstract foundation documentation for vector data ingestion patterns and best practices.
Table of Contents¶
- Concepts
- Architecture Patterns
- Best Practices
- Provider Support
- Performance Considerations
- Error Handling
- Deduplication
- Resumability
- Use Case Patterns
- Advanced Topics
Concepts¶
Chunking¶
Chunking is the process of breaking down large content into smaller, manageable pieces suitable for embedding and vector search.
Why Chunk?
- Embedding models have token limits
- Smaller chunks improve search precision
- Enables granular retrieval of relevant content
Chunking Strategies:
- Token-based: Split by estimated token count
- Structure-aware: Split by document structure (headers, paragraphs)
- Semantic: Split by semantic boundaries (future enhancement)
Embedding Generation¶
Embeddings are numeric vector representations of text that capture semantic meaning.
Key Considerations:
- Embedding model selection affects quality
- Vector dimensions vary by model
- Batch processing improves throughput
Indexing¶
Indexing stores embeddings in vector stores optimized for similarity search.
Indexing Operations:
- Upsert: Insert or update records
- Batch Upsert: Process multiple records efficiently
- Idempotent Upsert: Safe to retry without duplicates
Architecture Patterns¶
Provider-Agnostic Design¶
All ingestion logic uses abstractions from Microsoft.Extensions.AI and Microsoft.Extensions.VectorData:
- Embeddings:
IEmbeddingGenerator<string, Embedding<float>> - Vector Stores:
VectorStoreandVectorStoreCollection<TKey, TRecord>
This ensures the same code works across all providers without modification.
Pipeline Pattern¶
Each stage is independent and can be optimized separately.
Batch Processing Pattern¶
Process chunks in configurable batches:
- Improves throughput
- Reduces memory pressure
- Enables progress tracking
Retry Pattern¶
Exponential backoff retry for transient failures:
- Network errors
- Rate limits
- Temporary service unavailability
Best Practices¶
Chunking Strategies¶
Choose the Right Chunker:
- Token Chunker: General-purpose, works for most content
- Markdown Chunker: Documentation, structured content
- Paragraph Chunker: Long-form content, articles
Chunk Size Guidelines:
- Small (256 tokens): High precision, many chunks
- Medium (512 tokens): Balanced precision and context
- Large (1024 tokens): More context, fewer chunks
Overlap Considerations:
- 10-20% overlap preserves context across boundaries
- Too much overlap wastes resources
- Too little overlap loses context
Batch Sizing¶
Guidelines:
- Start with 64 chunks per batch
- Increase for high-throughput scenarios
- Decrease if memory is constrained
- Monitor embedding generator rate limits
Throttling:
- Add delays between batches to respect rate limits
- Adjust based on provider capabilities
- Monitor for rate limit errors
Retry Policies¶
Transient Failures (Retry):
- Network timeouts
- HTTP 503 (Service Unavailable)
- HTTP 502 (Bad Gateway)
- Rate limit errors (429)
Permanent Failures (Don't Retry):
- Authentication errors (401, 403)
- Invalid content (400)
- Not found errors (404)
Backoff Strategy:
- Start with 1 second delay
- Double delay on each retry
- Cap at maximum delay (e.g., 30 seconds)
Provider Support¶
Qdrant¶
Characteristics:
- High performance
- Low latency
- Good for large-scale deployments
Considerations:
- Supports gRPC and REST APIs
- Collection management required
- Good batch upsert performance
Azure AI Search¶
Characteristics:
- Managed service
- Integrated with Azure ecosystem
- Index lag may occur
Considerations:
- Index updates may take time to propagate
- Rate limits apply
- Good for enterprise scenarios
pgvector (PostgreSQL)¶
Characteristics:
- SQL-based
- ACID transactions
- Familiar tooling
Considerations:
- Requires PostgreSQL with pgvector extension
- Good for applications already using PostgreSQL
- Transaction support for consistency
SQL Server¶
Characteristics:
- Native SQL Server integration
- Vector support in SQL Server 2022+
- Enterprise-grade reliability
Considerations:
- Requires SQL Server with vector support
- Good for Microsoft ecosystem
- Transaction support available
Performance Considerations¶
Throughput Optimization¶
Factors Affecting Throughput:
- Embedding Generator Latency: Primary bottleneck
- Vector Store Performance: Varies by provider
- Network Latency: Between services
- Batch Size: Larger batches = higher throughput (up to limits)
Optimization Strategies:
- Increase batch size (within memory limits)
- Reduce throttle delays (respect rate limits)
- Use faster embedding generators
- Choose high-performance vector stores
Latency Reduction¶
Strategies:
- Smaller batch sizes for faster feedback
- Parallel processing for multiple sources
- Optimize network paths
- Use local/regional services
Resource Management¶
Memory:
- Batch size directly affects memory usage
- Monitor memory during large ingestions
- Adjust batch size based on available memory
CPU:
- Chunking is CPU-intensive
- Consider parallel chunking for large files
- Monitor CPU usage during ingestion
Error Handling¶
Transient vs Permanent Failures¶
Transient Failures:
- Network timeouts
- Service unavailability
- Rate limits
- Action: Retry with exponential backoff
Permanent Failures:
- Invalid content
- Authentication errors
- Configuration errors
- Action: Log error and skip
Error Recovery¶
Checkpointing:
- Save progress at batch boundaries
- Resume from last successful batch
- Track failed chunks for retry
Partial Success:
- Continue processing after individual chunk failures
- Collect errors for review
- Report success/failure statistics
Deduplication¶
Hash-Based Identification¶
Hash Computation:
- SHA256 hash of chunk text + source metadata
- Deterministic: same content = same hash
- Fast computation
Chunk ID Format:
Benefits:
- Prevents duplicate embeddings
- Enables idempotent re-ingestion
- Reduces storage costs
Deduplication Strategy¶
Before Embedding:
- Check if chunk exists by ID
- Skip embedding generation if found
- Reduces API calls and costs
Implementation:
Resumability¶
Checkpoint Pattern¶
Checkpoint Data:
- Run ID
- Processed source IDs
- Chunk status (processed, indexed, failed)
- Timestamp
Checkpoint Storage:
- File system (development)
- Database (production)
- Distributed cache (scaled scenarios)
Resume Strategy¶
On Resume:
- Load checkpoint
- Identify processed sources
- Skip already-processed content
- Continue from last checkpoint
Benefits:
- No lost progress on failures
- Supports long-running ingestions
- Enables incremental updates
Implementation Considerations¶
Checkpoint Frequency:
- After each batch (recommended)
- After each source (for large sources)
- Periodic (for very long runs)
Checkpoint Cleanup:
- Remove old checkpoints
- Archive completed runs
- Monitor checkpoint storage size
Use Case Patterns¶
Pattern 1: Batch Ingestion¶
When to Use:
- Large-scale data migration
- Initial knowledge base population
- Periodic bulk updates
Characteristics:
- High batch sizes (64-128 chunks)
- Throttling between batches
- Progress tracking and checkpointing
- Error collection and reporting
Example Scenarios:
- Migrating legacy documentation to vector search
- Indexing historical product catalogs
- Populating knowledge bases from archives
Pattern 2: Real-Time Ingestion¶
When to Use:
- Live content updates
- Event-driven systems
- Time-sensitive information
Characteristics:
- Small batch sizes (8-16 chunks)
- Minimal throttling
- Low-latency processing
- Immediate searchability
Example Scenarios:
- News article indexing
- Social media content ingestion
- Real-time chat message indexing
Pattern 3: Incremental Updates¶
When to Use:
- Content that changes frequently
- Version-controlled documentation
- Collaborative content
Characteristics:
- Deduplication enabled
- Hash-based change detection
- Selective re-indexing
- Version tracking in metadata
Example Scenarios:
- Documentation updates
- Wiki page changes
- Product description updates
Pattern 4: Multi-Source Aggregation¶
When to Use:
- Combining data from multiple sources
- Unified search across systems
- Data integration scenarios
Characteristics:
- Multiple source identifiers
- Rich metadata for filtering
- Source-specific chunking strategies
- Unified collection or source-specific collections
Example Scenarios:
- Aggregating support tickets from multiple systems
- Combining documentation from multiple repositories
- Unified search across multiple knowledge bases
Pattern 5: Structured Document Processing¶
When to Use:
- Legal documents
- Technical documentation
- Academic papers
- Contracts and agreements
Characteristics:
- Structure-aware chunking (Markdown, Paragraph)
- Preserve document hierarchy
- Metadata-rich indexing
- Citation and reference tracking
Example Scenarios:
- Legal case file indexing
- Research paper databases
- Technical specification indexing
Pattern 6: Multi-Language Content¶
When to Use:
- International applications
- Global knowledge bases
- Multi-language support systems
Characteristics:
- Language detection and tagging
- Language-specific chunking strategies
- Cross-language search support
- Cultural context preservation
Example Scenarios:
- International customer support
- Multi-language documentation
- Global product catalogs
Advanced Topics¶
Custom Chunking Strategies¶
While the pipeline provides Token, Markdown, and Paragraph chunkers, you can implement custom chunking strategies:
When to Create Custom Chunkers:
- Domain-specific content structure
- Specialized formatting requirements
- Performance optimizations
- Unique metadata requirements
Implementation Considerations:
- Implement
IChunkerinterface - Ensure deterministic chunking
- Preserve relevant metadata
- Handle edge cases gracefully
Metadata Strategy¶
Best Practices:
- Include source identifiers for traceability
- Add timestamps for version tracking
- Store filtering attributes (category, type, status)
- Preserve relationships (parent-child, references)
- Keep metadata lightweight (avoid large objects)
Metadata Patterns:
- Source Tracking:
source_id,source_type,source_version - Temporal:
created_at,updated_at,expires_at - Categorization:
category,tags,classification - Relationships:
parent_id,related_ids,references
Performance Optimization¶
Embedding Generation:
- Batch embedding requests when possible
- Cache embeddings for identical content
- Use appropriate embedding models for your use case
- Monitor embedding API rate limits
Vector Store Operations:
- Use batch upserts for efficiency
- Optimize collection indexes
- Consider vector dimensions and their impact
- Monitor storage growth and cleanup
Chunking Optimization:
- Choose chunk size based on embedding model limits
- Balance chunk count vs. context preservation
- Consider overlap requirements
- Profile chunking performance for large documents
Monitoring and Observability¶
Key Metrics to Track:
- Chunks processed per second
- Embedding generation latency
- Vector store upsert latency
- Deduplication hit rate
- Error rates by type
- Batch processing efficiency
Logging Best Practices:
- Log ingestion start/completion with timing
- Log chunk counts and statistics
- Log errors with context (chunk ID, source ID)
- Use structured logging for analysis
- Include correlation IDs for tracing
Health Checks:
- Monitor vector store connectivity
- Verify embedding generator availability
- Check collection existence and health
- Track ingestion queue depth (if applicable)
Security Considerations¶
Data Privacy:
- Sanitize sensitive information before ingestion
- Use field-level encryption for sensitive metadata
- Implement access controls on collections
- Audit ingestion operations
Authentication:
- Secure embedding generator API keys
- Use managed identities where possible
- Rotate credentials regularly
- Monitor for unauthorized access
Compliance:
- Track data lineage through metadata
- Implement data retention policies
- Support data deletion requests
- Maintain audit logs
Scaling Strategies¶
Horizontal Scaling:
- Distribute ingestion across multiple instances
- Use message queues for ingestion requests
- Implement idempotent ingestion operations
- Coordinate checkpointing across instances
Vertical Scaling:
- Increase batch sizes for higher throughput
- Optimize memory usage for large batches
- Use async/await for I/O operations
- Profile and optimize hot paths
Resource Management:
- Monitor memory usage during batch processing
- Implement backpressure for high-load scenarios
- Use connection pooling for vector stores
- Implement circuit breakers for external services
Integration Patterns¶
Event-Driven Ingestion:
- Subscribe to content change events
- Process events asynchronously
- Handle event ordering and idempotency
- Implement dead-letter queues for failures
Scheduled Ingestion:
- Use cron jobs or schedulers for periodic ingestion
- Implement incremental update detection
- Track last ingestion timestamps
- Handle missed schedules gracefully
API-Driven Ingestion:
- Expose ingestion endpoints
- Validate input before processing
- Return ingestion status and results
- Support bulk ingestion requests
Troubleshooting Common Issues¶
Issue: Low Deduplication Hit Rate
- Cause: Chunk IDs not deterministic or cache cleared
- Solution: Verify hash computation, implement persistent cache
Issue: High Embedding API Costs
- Cause: Re-generating embeddings for duplicate content
- Solution: Improve deduplication, cache embeddings
Issue: Slow Ingestion Performance
- Cause: Small batch sizes, high throttling, network latency
- Solution: Increase batch size, reduce throttling, optimize network
Issue: Memory Pressure
- Cause: Large batch sizes, too many concurrent operations
- Solution: Reduce batch size, limit concurrency, implement streaming
Issue: Inconsistent Chunk Counts
- Cause: Non-deterministic chunking, configuration changes
- Solution: Verify chunker determinism, version configuration