Index Lifecycle Management

A vector index is not static. Documents are added, updated, and deleted over time. Index lifecycle management covers incremental updates, stale chunk detection, and when and how to re-index.
Author

Benedict Thekkel

1. The Three Operations

Every index lifecycle strategy reduces to three operations:

Operation Trigger Action
Add New document ingested Chunk → embed → upsert chunks into index
Update Existing document changed Delete old chunks by doc ID → re-chunk → embed → insert
Delete Document removed or expired Delete all chunks associated with that doc ID

Key design requirement: each chunk must store its parent document ID as metadata so you can efficiently delete/update all chunks for a given document.

# Metadata on each chunk
{
  "chunk_id": "doc42_chunk3",
  "doc_id": "doc42",
  "source_url": "https://...",
  "last_modified": "2024-11-01T10:00:00Z",
  "checksum": "sha256:abc123..."
}

2. Incremental Ingestion Pipeline

A full re-index on every change is expensive. Incremental pipelines process only changed documents.

Pattern: change detection via checksum or last_modified

def sync_document(doc_id: str, content: str, last_modified: str):
    existing = metadata_store.get(doc_id)

    new_checksum = sha256(content)

    if existing and existing["checksum"] == new_checksum:
        return  # No change — skip

    # Delete old chunks
    vector_index.delete(filter={"doc_id": doc_id})

    # Re-chunk, embed, and insert
    chunks = chunk(content)
    embeddings = embed_batch(chunks)
    vector_index.upsert([
        {"id": f"{doc_id}_chunk{i}", "vector": emb, "metadata": {"doc_id": doc_id, ...}}
        for i, emb in enumerate(embeddings)
    ])

    # Update metadata store
    metadata_store.set(doc_id, {"checksum": new_checksum, "last_modified": last_modified})

Source connectors with built-in change detection: Confluence, Notion, SharePoint, S3 event notifications, database CDC (Debezium).


3. Stale Chunk Detection

Chunks become stale when the source document changes but the index hasn’t been updated.

Detection strategies:

Strategy How Latency
Periodic crawl Scheduler re-fetches all source URLs, compares checksums Hours to days
Webhook / event-driven Source system pushes change notifications Near real-time
TTL-based Chunks expire after N days and must be re-ingested Guaranteed freshness bound
Retrieval-time validation On retrieval, fetch source and verify chunk still exists Real-time but adds latency

Best practice: combine event-driven for high-change sources with periodic crawl as a safety net for missed events.


4. Re-Indexing Triggers

A full re-index (re-embed and re-insert all documents) is expensive but sometimes unavoidable.

When to trigger a full re-index:

Trigger Reason
Embedding model upgrade New model produces incompatible vector space
Chunking strategy change Old chunks don’t reflect new boundaries
Metadata schema change Old chunks are missing new required fields
Vector DB migration Moving to a different database or collection
Corpus quality audit Found widespread errors (broken PDFs, encoding issues)

Zero-downtime re-index pattern:

1. Create a new shadow index (new_index_v2)
2. Ingest all documents into new_index_v2 while new_index_v1 serves traffic
3. Once ingestion completes, run evaluation to confirm new index quality ≥ old
4. Atomic alias swap: alias "production" → new_index_v2
5. Delete old index after TTL (give time to roll back if needed)

5. Index Versioning

Treat your index like a software artifact: version it so you can roll back.

Versioning approaches:

Approach Description
Aliased collections Pinecone/Weaviate support named aliases; point alias at a versioned collection
Namespace versioning Use namespaces (e.g., corpus_v3) and route queries to the correct one
Snapshot backup Periodic export of index state to S3 for disaster recovery

Metadata to track per version:

{
  "index_version": "v3",
  "embedding_model": "text-embedding-3-large",
  "chunk_strategy": "recursive_512_64",
  "created_at": "2024-11-01",
  "doc_count": 12450,
  "chunk_count": 89320
}

6. Deletion and Retention Policies

Uncontrolled growth bloats index size and degrades search performance.

Retention policy types:

Policy Use Case
TTL expiry News, support tickets — auto-delete after N days
Source-driven Delete when source document is deleted (event or crawl)
Hard delete GDPR / right to erasure — delete by doc ID immediately
Soft delete Mark as deleted in metadata; exclude via filter; physically purge in batch

GDPR compliance: vector databases store embeddings, which may be considered personal data if the source is personal. Ensure hard delete is possible by doc ID and test it.

Index compaction: Most vector DBs accumulate tombstoned vectors. Periodically trigger compaction or vacuum to reclaim space.


Summary

Document change detected
        │
        ▼
   Has checksum changed? ──NO──→ Skip
        │YES
        ▼
 Delete old chunks (by doc_id)
        │
        ▼
 Chunk → Embed → Insert new chunks
        │
        ▼
 Update metadata store (checksum, last_modified)

Rule of thumb: Store doc ID on every chunk. Without it, update and delete operations become full index scans.

Back to top