Skip to content

Vector Ingestion

Abstract foundation documentation for vector data ingestion patterns and best practices.

Official references

In the ConnectSoft Base Template, ConnectSoft.Extensions.DataIngestion registers Microsoft chunkers and metrics; VectorIngestionHost composes IngestionPipeline with MarkdownReader and VectorStoreWriter against your configured VectorStore.

ChunkerType to Microsoft chunker mapping (Base Template library)

ConnectSoft.Extensions.DataIngestion maps configuration enum ChunkerType to Microsoft IngestionChunker<string> implementations (see DataIngestionExtensions):

ChunkerType Microsoft implementation Notes
Token HeaderChunker Header-aware splitting with token limits
Markdown DocumentTokenChunker Tokenized chunks; suitable for markdown-heavy content
Paragraph HeaderChunker Same as Token mapping in current registration

SectionChunker is available in the Microsoft SDK alongside the above; extend registration if you need section-per-entity behavior.

Table of Contents

Concepts

Chunking

Chunking is the process of breaking down large content into smaller, manageable pieces suitable for embedding and vector search.

Why Chunk?

  • Embedding models have token limits
  • Smaller chunks improve search precision
  • Enables granular retrieval of relevant content

Chunking Strategies:

  • Token-based: Split by estimated token count
  • Structure-aware: Split by document structure (headers, paragraphs)
  • Semantic: Split by semantic boundaries (future enhancement)

Embedding Generation

Embeddings are numeric vector representations of text that capture semantic meaning.

Key Considerations:

  • Embedding model selection affects quality
  • Vector dimensions vary by model
  • Batch processing improves throughput

Indexing

Indexing stores embeddings in vector stores optimized for similarity search.

Indexing Operations:

  • Upsert: Insert or update records
  • Batch Upsert: Process multiple records efficiently
  • Idempotent Upsert: Safe to retry without duplicates

Architecture Patterns

Provider-Agnostic Design

All ingestion logic uses abstractions from Microsoft.Extensions.AI and Microsoft.Extensions.VectorData:

  • Embeddings: IEmbeddingGenerator<string, Embedding<float>>
  • Vector Stores: VectorStore and VectorStoreCollection<TKey, TRecord>

This ensures the same code works across all providers without modification.

Pipeline Pattern

Input → Chunking → Deduplication → Embedding → Indexing → Output

Each stage is independent and can be optimized separately.

Batch Processing Pattern

Process chunks in configurable batches:

  • Improves throughput
  • Reduces memory pressure
  • Enables progress tracking

Retry Pattern

Exponential backoff retry for transient failures:

  • Network errors
  • Rate limits
  • Temporary service unavailability

Best Practices

Chunking Strategies

Choose the Right Chunker:

  • Token Chunker: General-purpose, works for most content
  • Markdown Chunker: Documentation, structured content
  • Paragraph Chunker: Long-form content, articles

Chunk Size Guidelines:

  • Small (256 tokens): High precision, many chunks
  • Medium (512 tokens): Balanced precision and context
  • Large (1024 tokens): More context, fewer chunks

Overlap Considerations:

  • 10-20% overlap preserves context across boundaries
  • Too much overlap wastes resources
  • Too little overlap loses context

Batch Sizing

Guidelines:

  • Start with 64 chunks per batch
  • Increase for high-throughput scenarios
  • Decrease if memory is constrained
  • Monitor embedding generator rate limits

Throttling:

  • Add delays between batches to respect rate limits
  • Adjust based on provider capabilities
  • Monitor for rate limit errors

Retry Policies

Transient Failures (Retry):

  • Network timeouts
  • HTTP 503 (Service Unavailable)
  • HTTP 502 (Bad Gateway)
  • Rate limit errors (429)

Permanent Failures (Don't Retry):

  • Authentication errors (401, 403)
  • Invalid content (400)
  • Not found errors (404)

Backoff Strategy:

  • Start with 1 second delay
  • Double delay on each retry
  • Cap at maximum delay (e.g., 30 seconds)

Provider Support

Qdrant

Characteristics:

  • High performance
  • Low latency
  • Good for large-scale deployments

Considerations:

  • Supports gRPC and REST APIs
  • Collection management required
  • Good batch upsert performance

Characteristics:

  • Managed service
  • Integrated with Azure ecosystem
  • Index lag may occur

Considerations:

  • Index updates may take time to propagate
  • Rate limits apply
  • Good for enterprise scenarios

pgvector (PostgreSQL)

Characteristics:

  • SQL-based
  • ACID transactions
  • Familiar tooling

Considerations:

  • Requires PostgreSQL with pgvector extension
  • Good for applications already using PostgreSQL
  • Transaction support for consistency

SQL Server

Characteristics:

  • Native SQL Server integration
  • Vector support in SQL Server 2022+
  • Enterprise-grade reliability

Considerations:

  • Requires SQL Server with vector support
  • Good for Microsoft ecosystem
  • Transaction support available

Performance Considerations

Throughput Optimization

Factors Affecting Throughput:

  1. Embedding Generator Latency: Primary bottleneck
  2. Vector Store Performance: Varies by provider
  3. Network Latency: Between services
  4. Batch Size: Larger batches = higher throughput (up to limits)

Optimization Strategies:

  • Increase batch size (within memory limits)
  • Reduce throttle delays (respect rate limits)
  • Use faster embedding generators
  • Choose high-performance vector stores

Latency Reduction

Strategies:

  • Smaller batch sizes for faster feedback
  • Parallel processing for multiple sources
  • Optimize network paths
  • Use local/regional services

Resource Management

Memory:

  • Batch size directly affects memory usage
  • Monitor memory during large ingestions
  • Adjust batch size based on available memory

CPU:

  • Chunking is CPU-intensive
  • Consider parallel chunking for large files
  • Monitor CPU usage during ingestion

Error Handling

Transient vs Permanent Failures

Transient Failures:

  • Network timeouts
  • Service unavailability
  • Rate limits
  • Action: Retry with exponential backoff

Permanent Failures:

  • Invalid content
  • Authentication errors
  • Configuration errors
  • Action: Log error and skip

Error Recovery

Checkpointing:

  • Save progress at batch boundaries
  • Resume from last successful batch
  • Track failed chunks for retry

Partial Success:

  • Continue processing after individual chunk failures
  • Collect errors for review
  • Report success/failure statistics

Deduplication

Hash-Based Identification

Hash Computation:

  • SHA256 hash of chunk text + source metadata
  • Deterministic: same content = same hash
  • Fast computation

Chunk ID Format:

{sourceId}:{chunkIndex}:{hash}

Benefits:

  • Prevents duplicate embeddings
  • Enables idempotent re-ingestion
  • Reduces storage costs

Deduplication Strategy

Before Embedding:

  • Check if chunk exists by ID
  • Skip embedding generation if found
  • Reduces API calls and costs

Implementation:

if (chunkExists)
{
    // Skip embedding generation
    chunksSkipped++;
    continue;
}

Resumability

Checkpoint Pattern

Checkpoint Data:

  • Run ID
  • Processed source IDs
  • Chunk status (processed, indexed, failed)
  • Timestamp

Checkpoint Storage:

  • File system (development)
  • Database (production)
  • Distributed cache (scaled scenarios)

Resume Strategy

On Resume:

  1. Load checkpoint
  2. Identify processed sources
  3. Skip already-processed content
  4. Continue from last checkpoint

Benefits:

  • No lost progress on failures
  • Supports long-running ingestions
  • Enables incremental updates

Implementation Considerations

Checkpoint Frequency:

  • After each batch (recommended)
  • After each source (for large sources)
  • Periodic (for very long runs)

Checkpoint Cleanup:

  • Remove old checkpoints
  • Archive completed runs
  • Monitor checkpoint storage size

Use Case Patterns

Pattern 1: Batch Ingestion

When to Use:

  • Large-scale data migration
  • Initial knowledge base population
  • Periodic bulk updates

Characteristics:

  • High batch sizes (64-128 chunks)
  • Throttling between batches
  • Progress tracking and checkpointing
  • Error collection and reporting

Example Scenarios:

  • Migrating legacy documentation to vector search
  • Indexing historical product catalogs
  • Populating knowledge bases from archives

Pattern 2: Real-Time Ingestion

When to Use:

  • Live content updates
  • Event-driven systems
  • Time-sensitive information

Characteristics:

  • Small batch sizes (8-16 chunks)
  • Minimal throttling
  • Low-latency processing
  • Immediate searchability

Example Scenarios:

  • News article indexing
  • Social media content ingestion
  • Real-time chat message indexing

Pattern 3: Incremental Updates

When to Use:

  • Content that changes frequently
  • Version-controlled documentation
  • Collaborative content

Characteristics:

  • Deduplication enabled
  • Hash-based change detection
  • Selective re-indexing
  • Version tracking in metadata

Example Scenarios:

  • Documentation updates
  • Wiki page changes
  • Product description updates

Pattern 4: Multi-Source Aggregation

When to Use:

  • Combining data from multiple sources
  • Unified search across systems
  • Data integration scenarios

Characteristics:

  • Multiple source identifiers
  • Rich metadata for filtering
  • Source-specific chunking strategies
  • Unified collection or source-specific collections

Example Scenarios:

  • Aggregating support tickets from multiple systems
  • Combining documentation from multiple repositories
  • Unified search across multiple knowledge bases

Pattern 5: Structured Document Processing

When to Use:

  • Legal documents
  • Technical documentation
  • Academic papers
  • Contracts and agreements

Characteristics:

  • Structure-aware chunking (Markdown, Paragraph)
  • Preserve document hierarchy
  • Metadata-rich indexing
  • Citation and reference tracking

Example Scenarios:

  • Legal case file indexing
  • Research paper databases
  • Technical specification indexing

Pattern 6: Multi-Language Content

When to Use:

  • International applications
  • Global knowledge bases
  • Multi-language support systems

Characteristics:

  • Language detection and tagging
  • Language-specific chunking strategies
  • Cross-language search support
  • Cultural context preservation

Example Scenarios:

  • International customer support
  • Multi-language documentation
  • Global product catalogs

Advanced Topics

Custom Chunking Strategies

While the pipeline provides Token, Markdown, and Paragraph chunkers, you can implement custom chunking strategies:

When to Create Custom Chunkers:

  • Domain-specific content structure
  • Specialized formatting requirements
  • Performance optimizations
  • Unique metadata requirements

Implementation Considerations:

  • Implement IChunker interface
  • Ensure deterministic chunking
  • Preserve relevant metadata
  • Handle edge cases gracefully

Metadata Strategy

Best Practices:

  • Include source identifiers for traceability
  • Add timestamps for version tracking
  • Store filtering attributes (category, type, status)
  • Preserve relationships (parent-child, references)
  • Keep metadata lightweight (avoid large objects)

Metadata Patterns:

  • Source Tracking: source_id, source_type, source_version
  • Temporal: created_at, updated_at, expires_at
  • Categorization: category, tags, classification
  • Relationships: parent_id, related_ids, references

Performance Optimization

Embedding Generation:

  • Batch embedding requests when possible
  • Cache embeddings for identical content
  • Use appropriate embedding models for your use case
  • Monitor embedding API rate limits

Vector Store Operations:

  • Use batch upserts for efficiency
  • Optimize collection indexes
  • Consider vector dimensions and their impact
  • Monitor storage growth and cleanup

Chunking Optimization:

  • Choose chunk size based on embedding model limits
  • Balance chunk count vs. context preservation
  • Consider overlap requirements
  • Profile chunking performance for large documents

Monitoring and Observability

Key Metrics to Track:

  • Chunks processed per second
  • Embedding generation latency
  • Vector store upsert latency
  • Deduplication hit rate
  • Error rates by type
  • Batch processing efficiency

Logging Best Practices:

  • Log ingestion start/completion with timing
  • Log chunk counts and statistics
  • Log errors with context (chunk ID, source ID)
  • Use structured logging for analysis
  • Include correlation IDs for tracing

Health Checks:

  • Monitor vector store connectivity
  • Verify embedding generator availability
  • Check collection existence and health
  • Track ingestion queue depth (if applicable)

Security Considerations

Data Privacy:

  • Sanitize sensitive information before ingestion
  • Use field-level encryption for sensitive metadata
  • Implement access controls on collections
  • Audit ingestion operations

Authentication:

  • Secure embedding generator API keys
  • Use managed identities where possible
  • Rotate credentials regularly
  • Monitor for unauthorized access

Compliance:

  • Track data lineage through metadata
  • Implement data retention policies
  • Support data deletion requests
  • Maintain audit logs

Scaling Strategies

Horizontal Scaling:

  • Distribute ingestion across multiple instances
  • Use message queues for ingestion requests
  • Implement idempotent ingestion operations
  • Coordinate checkpointing across instances

Vertical Scaling:

  • Increase batch sizes for higher throughput
  • Optimize memory usage for large batches
  • Use async/await for I/O operations
  • Profile and optimize hot paths

Resource Management:

  • Monitor memory usage during batch processing
  • Implement backpressure for high-load scenarios
  • Use connection pooling for vector stores
  • Implement circuit breakers for external services

Integration Patterns

Event-Driven Ingestion:

  • Subscribe to content change events
  • Process events asynchronously
  • Handle event ordering and idempotency
  • Implement dead-letter queues for failures

Scheduled Ingestion:

  • Use cron jobs or schedulers for periodic ingestion
  • Implement incremental update detection
  • Track last ingestion timestamps
  • Handle missed schedules gracefully

API-Driven Ingestion:

  • Expose ingestion endpoints
  • Validate input before processing
  • Return ingestion status and results
  • Support bulk ingestion requests

Troubleshooting Common Issues

Issue: Low Deduplication Hit Rate

  • Cause: Chunk IDs not deterministic or cache cleared
  • Solution: Verify hash computation, implement persistent cache

Issue: High Embedding API Costs

  • Cause: Re-generating embeddings for duplicate content
  • Solution: Improve deduplication, cache embeddings

Issue: Slow Ingestion Performance

  • Cause: Small batch sizes, high throttling, network latency
  • Solution: Increase batch size, reduce throttling, optimize network

Issue: Memory Pressure

  • Cause: Large batch sizes, too many concurrent operations
  • Solution: Reduce batch size, limit concurrency, implement streaming

Issue: Inconsistent Chunk Counts

  • Cause: Non-deterministic chunking, configuration changes
  • Solution: Verify chunker determinism, version configuration

References