Skip to content

Reverse ETL with Lakebase Provisioned

Skill: databricks-lakebase-provisioned

Reverse ETL bridges the gap between your analytical lakehouse and application-tier databases. You define a synced table that mirrors a Unity Catalog Delta table into Lakebase Provisioned as a PostgreSQL table. Applications then query Postgres directly for sub-millisecond point lookups — product catalogs, user profiles, feature stores — without hitting your lakehouse for every request.

“Using Python and the Databricks SDK, create a synced table that copies analytics.gold.user_profiles into Lakebase Provisioned with triggered scheduling.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.database import (
SyncedDatabaseTable,
SyncedTableSpec,
SyncedTableSchedulingPolicy,
)
w = WorkspaceClient()
synced_table = w.database.create_synced_database_table(
SyncedDatabaseTable(
name="lakebase_catalog.public.user_profiles",
database_instance_name="my-lakebase-instance",
spec=SyncedTableSpec(
source_table_full_name="analytics.gold.user_profiles",
primary_key_columns=["user_id"],
scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED,
),
)
)
print(f"Created synced table: {synced_table.name}")

Key decisions:

  • TRIGGERED scheduling syncs on demand — you control when data flows. Best for batch pipelines where you sync after an ETL job completes.
  • CONTINUOUS scheduling syncs automatically as the source table changes. Best for real-time use cases like user profiles or feature stores.
  • SNAPSHOT scheduling takes a full copy each sync. Use this when Change Data Feed is not enabled on the source table.
  • The source Delta table must have Change Data Feed enabled for TRIGGERED and CONTINUOUS modes. Run ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true) before creating the sync.
  • primary_key_columns is required — it defines how rows are upserted into Postgres during incremental syncs.
ALTER TABLE analytics.gold.user_profiles
SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

Without this, TRIGGERED and CONTINUOUS scheduling fail silently or fall back to full snapshots.

“Using Python, sync a product catalog into Lakebase Provisioned for low-latency lookups from a storefront app.”

w.database.create_synced_database_table(
SyncedDatabaseTable(
name="ecommerce_catalog.public.products",
database_instance_name="ecommerce-db",
spec=SyncedTableSpec(
source_table_full_name="gold.products.catalog",
primary_key_columns=["product_id"],
scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED,
),
)
)

TRIGGERED scheduling works well here because the product catalog updates on a known schedule — after the nightly ETL job. Your orchestration job syncs the table as its final step, and the storefront queries Postgres with single-digit millisecond latency.

Real-time user profiles with continuous sync

Section titled “Real-time user profiles with continuous sync”

“Using Python, set up continuous sync for user profiles so the auth service always sees fresh data.”

w.database.create_synced_database_table(
SyncedDatabaseTable(
name="auth_catalog.public.user_profiles",
database_instance_name="auth-db",
spec=SyncedTableSpec(
source_table_full_name="gold.users.profiles",
primary_key_columns=["user_id"],
scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS,
),
)
)

CONTINUOUS mode watches the source Delta table and pushes changes as they arrive. The auth service always reads near-real-time data without any manual sync step. Use this when stale data causes visible user impact.

“Using Python, sync feature vectors into Lakebase so the ML model can query them at inference time.”

w.database.create_synced_database_table(
SyncedDatabaseTable(
name="ml_catalog.public.user_features",
database_instance_name="feature-store-db",
spec=SyncedTableSpec(
source_table_full_name="ml.features.user_features",
primary_key_columns=["user_id"],
scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS,
),
)
)

Feature serving is a classic reverse ETL use case. The feature engineering pipeline writes to Delta. The synced table pushes those features into Postgres. The model serving endpoint queries Postgres for point lookups at inference time — no Spark overhead for a single-row read.

“Using Python, check if my synced table is healthy and see the latest sync state.”

status = w.database.get_synced_database_table(
name="lakebase_catalog.public.user_profiles"
)
print(f"State: {status.data_synchronization_status.detailed_state}")
print(f"Message: {status.data_synchronization_status.message}")

Check detailed_state after creating or modifying a synced table. Common states: ACTIVE means syncing normally, FAILED includes a message explaining why, and PROVISIONING means the initial sync is still running.

“Using the Databricks CLI, create a synced table for reverse ETL.”

Terminal window
databricks database create-synced-database-table \
--json '{
"name": "lakebase_catalog.public.synced_table",
"database_instance_name": "my-lakebase-instance",
"spec": {
"source_table_full_name": "analytics.gold.user_profiles",
"primary_key_columns": ["user_id"],
"scheduling_policy": "TRIGGERED"
}
}'

“Using Python, delete the synced table and drop the Postgres table.”

# Remove the sync definition
w.database.delete_synced_database_table(
name="lakebase_catalog.public.user_profiles"
)
-- Then drop the Postgres table to free storage
DROP TABLE lakebase_catalog.public.user_profiles;

Deleting the synced table stops the sync but does not drop the Postgres table. You need to drop it separately if you want to reclaim storage.

  • Source table missing Change Data Feed — TRIGGERED and CONTINUOUS modes require delta.enableChangeDataFeed = true on the source table. Without it, the sync either fails or silently falls back to expensive full snapshots.
  • Omitting primary_key_columns — this field is required. It defines how incremental updates are applied. Without a primary key, the sync cannot determine which rows to upsert versus insert.
  • Confusing delete behaviorsdelete_synced_database_table removes the sync definition but leaves the Postgres table intact. You must DROP TABLE separately to free storage. Conversely, dropping the Postgres table without deleting the sync definition causes the sync to fail on its next run.
  • Using CONTINUOUS for large, infrequently updated tables — continuous sync maintains an always-on process. For tables that update once a day, TRIGGERED scheduling is cheaper and produces the same result.