Index Types & Creation
Skill: databricks-vector-search
What You Can Build
Section titled “What You Can Build”You can stand up a vector index backed by a Delta table in under five minutes. The three index types cover the full spectrum: let Databricks handle embeddings automatically, bring your own pre-computed vectors, or manage everything through a real-time CRUD API. Your choice depends on how much control you need over the embedding lifecycle.
In Action
Section titled “In Action”“Create a Vector Search endpoint and a Delta Sync index with managed embeddings over my knowledge base table. Use Python and the Databricks SDK.”
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Create the compute endpoint firstendpoint = w.vector_search_endpoints.create_endpoint( name="docs-search-endpoint", endpoint_type="STORAGE_OPTIMIZED")
# Create a Delta Sync index with managed embeddingsindex = w.vector_search_indexes.create_index( name="catalog.schema.docs_index", endpoint_name="docs-search-endpoint", primary_key="doc_id", index_type="DELTA_SYNC", delta_sync_index_spec={ "source_table": "catalog.schema.documents", "embedding_source_columns": [{ "name": "content", "embedding_model_endpoint_name": "databricks-gte-large-en" }], "pipeline_type": "TRIGGERED", "columns_to_sync": ["doc_id", "content", "title", "category"] })Key decisions:
STORAGE_OPTIMIZEDendpoint handles 100M+ vectors at lower cost; chooseSTANDARDwhen you need sub-100ms latencyembedding_source_columnstells Databricks to compute embeddings from your text column automaticallyTRIGGEREDpipeline syncs on demand via API call, keeping costs predictable for batch workflowscolumns_to_synclimits what gets indexed — only include columns you’ll need in query results
More Patterns
Section titled “More Patterns”Bring Your Own Embeddings
Section titled “Bring Your Own Embeddings”“I already have embeddings from a fine-tuned model. Create a Delta Sync index using my pre-computed vectors in Python.”
index = w.vector_search_indexes.create_index( name="catalog.schema.custom_embed_index", endpoint_name="docs-search-endpoint", primary_key="id", index_type="DELTA_SYNC", delta_sync_index_spec={ "source_table": "catalog.schema.embedded_docs", "embedding_vector_columns": [{ "name": "embedding", "embedding_dimension": 768 }], "pipeline_type": "TRIGGERED" })Use embedding_vector_columns instead of embedding_source_columns when you pre-compute vectors. The dimension must match what your model produces — 768 for GTE-Large, 1536 for OpenAI text-embedding-3-small, and so on.
Real-Time CRUD with Direct Access
Section titled “Real-Time CRUD with Direct Access”“I need an index I can update in real time from my application without a Delta table. Use Python with the Databricks SDK.”
import json
index = w.vector_search_indexes.create_index( name="catalog.schema.realtime_index", endpoint_name="docs-search-endpoint", primary_key="id", index_type="DIRECT_ACCESS", direct_access_index_spec={ "embedding_vector_columns": [ {"name": "embedding", "embedding_dimension": 768} ], "schema_json": json.dumps({ "id": "string", "text": "string", "embedding": "array<float>", "category": "string" }) })
# Upsert vectors directly via APIw.vector_search_indexes.upsert_data_vector_index( index_name="catalog.schema.realtime_index", inputs_json=json.dumps([ {"id": "doc-001", "text": "ML basics", "embedding": [0.1, 0.2, ...], "category": "ml"}, {"id": "doc-002", "text": "Deep learning", "embedding": [0.4, 0.5, ...], "category": "dl"}, ]))Direct Access skips Delta table sync entirely. You manage inserts, updates, and deletes through the API. Use it when your data arrives from an application layer rather than a lakehouse pipeline.
Trigger a Manual Sync
Section titled “Trigger a Manual Sync”“My TRIGGERED index needs to pick up new rows I just inserted into the source table.”
w.vector_search_indexes.sync_index(index_name="catalog.schema.docs_index")For TRIGGERED pipelines, new data sits in the source table until you explicitly call sync_index(). The CONTINUOUS pipeline type auto-syncs on every table change, but costs more due to always-on compute.
Watch Out For
Section titled “Watch Out For”- Embedding dimension mismatch — if your query embeddings are 768-dimensional but the index was created with 1536, queries return garbage results silently. Always confirm the dimension matches your model output.
- Wrong filter syntax for your endpoint type — Standard endpoints use
filters_json(dictionary format), while Storage-Optimized endpoints usefilter_string(SQL syntax). Using the wrong one returns an error that doesn’t clearly explain the issue. - Forgetting
columns_to_sync— without it, every column in the source table gets synced. That increases storage costs and slows indexing, especially on wide tables. - Choosing
CONTINUOUSby default — it’s tempting for freshness, but the always-on compute adds up fast. Start withTRIGGEREDand move toCONTINUOUSonly when your use case requires near-real-time updates.