Index Types & Creation

Skill: databricks-vector-search

What You Can Build

You can stand up a vector index backed by a Delta table in under five minutes. The three index types cover the full spectrum: let Databricks handle embeddings automatically, bring your own pre-computed vectors, or manage everything through a real-time CRUD API. Your choice depends on how much control you need over the embedding lifecycle.

In Action

“Create a Vector Search endpoint and a Delta Sync index with managed embeddings over my knowledge base table. Use Python and the Databricks SDK.”

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create the compute endpoint first
endpoint = w.vector_search_endpoints.create_endpoint(
    name="docs-search-endpoint",
    endpoint_type="STORAGE_OPTIMIZED"
)

# Create a Delta Sync index with managed embeddings
index = w.vector_search_indexes.create_index(
    name="catalog.schema.docs_index",
    endpoint_name="docs-search-endpoint",
    primary_key="doc_id",
    index_type="DELTA_SYNC",
    delta_sync_index_spec={
        "source_table": "catalog.schema.documents",
        "embedding_source_columns": [{
            "name": "content",
            "embedding_model_endpoint_name": "databricks-gte-large-en"
        }],
        "pipeline_type": "TRIGGERED",
        "columns_to_sync": ["doc_id", "content", "title", "category"]
    }
)

Key decisions:

STORAGE_OPTIMIZED endpoint handles 100M+ vectors at lower cost; choose STANDARD when you need sub-100ms latency
embedding_source_columns tells Databricks to compute embeddings from your text column automatically
TRIGGERED pipeline syncs on demand via API call, keeping costs predictable for batch workflows
columns_to_sync limits what gets indexed — only include columns you’ll need in query results

More Patterns

Bring Your Own Embeddings

“I already have embeddings from a fine-tuned model. Create a Delta Sync index using my pre-computed vectors in Python.”

index = w.vector_search_indexes.create_index(
    name="catalog.schema.custom_embed_index",
    endpoint_name="docs-search-endpoint",
    primary_key="id",
    index_type="DELTA_SYNC",
    delta_sync_index_spec={
        "source_table": "catalog.schema.embedded_docs",
        "embedding_vector_columns": [{
            "name": "embedding",
            "embedding_dimension": 768
        }],
        "pipeline_type": "TRIGGERED"
    }
)

Use embedding_vector_columns instead of embedding_source_columns when you pre-compute vectors. The dimension must match what your model produces — 768 for GTE-Large, 1536 for OpenAI text-embedding-3-small, and so on.

Real-Time CRUD with Direct Access

“I need an index I can update in real time from my application without a Delta table. Use Python with the Databricks SDK.”

import json

index = w.vector_search_indexes.create_index(
    name="catalog.schema.realtime_index",
    endpoint_name="docs-search-endpoint",
    primary_key="id",
    index_type="DIRECT_ACCESS",
    direct_access_index_spec={
        "embedding_vector_columns": [
            {"name": "embedding", "embedding_dimension": 768}
        ],
        "schema_json": json.dumps({
            "id": "string",
            "text": "string",
            "embedding": "array<float>",
            "category": "string"
        })
    }
)

# Upsert vectors directly via API
w.vector_search_indexes.upsert_data_vector_index(
    index_name="catalog.schema.realtime_index",
    inputs_json=json.dumps([
        {"id": "doc-001", "text": "ML basics", "embedding": [0.1, 0.2, ...], "category": "ml"},
        {"id": "doc-002", "text": "Deep learning", "embedding": [0.4, 0.5, ...], "category": "dl"},
    ])
)

Direct Access skips Delta table sync entirely. You manage inserts, updates, and deletes through the API. Use it when your data arrives from an application layer rather than a lakehouse pipeline.

Trigger a Manual Sync

“My TRIGGERED index needs to pick up new rows I just inserted into the source table.”

w.vector_search_indexes.sync_index(index_name="catalog.schema.docs_index")

For TRIGGERED pipelines, new data sits in the source table until you explicitly call sync_index(). The CONTINUOUS pipeline type auto-syncs on every table change, but costs more due to always-on compute.

Watch Out For

Embedding dimension mismatch — if your query embeddings are 768-dimensional but the index was created with 1536, queries return garbage results silently. Always confirm the dimension matches your model output.
Wrong filter syntax for your endpoint type — Standard endpoints use filters_json (dictionary format), while Storage-Optimized endpoints use filter_string (SQL syntax). Using the wrong one returns an error that doesn’t clearly explain the issue.
Forgetting columns_to_sync — without it, every column in the source table gets synced. That increases storage costs and slows indexing, especially on wide tables.
Choosing CONTINUOUS by default — it’s tempting for freshness, but the always-on compute adds up fast. Start with TRIGGERED and move to CONTINUOUS only when your use case requires near-real-time updates.