Reverse ETL
Skill: databricks-lakebase-autoscale
What You Can Build
Section titled “What You Can Build”You can sync data from Unity Catalog Delta tables into Lakebase Autoscaling as PostgreSQL tables, enabling low-latency OLTP access patterns on data processed in the lakehouse. Product catalogs, user profiles, feature stores — any Gold-layer table you need to serve at sub-second latency can flow through a managed sync pipeline.
In Action
Section titled “In Action”“Using Python, create a triggered synced table that copies user profiles from a Delta table to Lakebase PostgreSQL.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.database import ( SyncedDatabaseTable, SyncedTableSpec, NewPipelineSpec, SyncedTableSchedulingPolicy,)
w = WorkspaceClient()
synced_table = w.database.create_synced_database_table( SyncedDatabaseTable( name="lakebase_catalog.schema.user_profiles", spec=SyncedTableSpec( source_table_full_name="analytics.gold.user_profiles", primary_key_columns=["user_id"], scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED, new_pipeline_spec=NewPipelineSpec( storage_catalog="lakebase_catalog", storage_schema="staging", ), ), ))Key decisions:
- Triggered mode syncs on demand — run it hourly or daily for dashboards. Continuous mode streams with seconds of latency for live apps. Snapshot does a one-time full copy for initial setup.
- Triggered and Continuous modes require Change Data Feed (CDF) enabled on the source Delta table
- Each synced table uses up to 16 connections and creates a managed Spark Declarative Pipeline under the hood
- Capacity limits: 2 TB total across all synced tables, recommended less than 1 TB per table
More Patterns
Section titled “More Patterns”Enabling Change Data Feed on the source table
Section titled “Enabling Change Data Feed on the source table”“Using SQL, enable CDF on a Delta table before setting up triggered or continuous sync.”
ALTER TABLE analytics.gold.user_profilesSET TBLPROPERTIES (delta.enableChangeDataFeed = true);CDF tracks row-level changes (inserts, updates, deletes) so the sync pipeline can process only the delta instead of re-copying the full table. Without CDF, triggered and continuous modes fail at creation time.
Real-time feature serving with continuous sync
Section titled “Real-time feature serving with continuous sync”“Using Python, set up a continuous synced table for serving ML features at low latency.”
w.database.create_synced_database_table( SyncedDatabaseTable( name="ml_catalog.public.user_features", spec=SyncedTableSpec( source_table_full_name="ml.features.user_features", primary_key_columns=["user_id"], scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS, ), ))Continuous mode delivers changes with seconds of latency, making it suitable for feature stores that serve real-time inference requests. The trade-off is higher cost and a minimum sync interval of 15 seconds.
Checking sync status
Section titled “Checking sync status”“Using Python, verify a synced table is healthy and processing changes.”
status = w.database.get_synced_database_table( name="lakebase_catalog.schema.user_profiles")print(f"State: {status.data_synchronization_status.detailed_state}")print(f"Message: {status.data_synchronization_status.message}")Monitor sync status after creation and periodically in production. The detailed_state field tells you whether the pipeline is initializing, running, or failed.
Choosing the right sync mode
Section titled “Choosing the right sync mode”“Which mode fits my workload?”
# Snapshot -- one-time full copy# Best for: initial data load, small reference tables, historical backfills# 10x more efficient than incremental if modifying >10% of data
# Triggered -- on-demand scheduled sync# Best for: dashboards refreshed hourly/daily, batch reporting tables# Requires CDF on source table
# Continuous -- real-time streaming# Best for: live apps, feature stores, real-time dashboards# Highest cost, minimum 15s intervals, requires CDF on source tableStart with Triggered for most use cases. Move to Continuous only when your application genuinely needs sub-minute latency from the lakehouse to the serving layer.
Watch Out For
Section titled “Watch Out For”- Forgetting to enable CDF — creating a Triggered or Continuous synced table against a source without Change Data Feed produces an immediate error. Enable CDF on the source table first.
- Non-additive schema changes on streaming syncs — Triggered and Continuous modes support only additive schema evolution (adding columns). Renaming or removing columns requires deleting and recreating the synced table.
- Exceeding connection limits — each synced table uses up to 16 connections. If your Lakebase compute has a low max-connection limit, a handful of synced tables can exhaust it. Factor synced table connections into your compute sizing.
- Trying to write to synced tables — synced tables are read-only in Lakebase. The sync pipeline owns the data. If your app needs to write additional columns or rows, create a separate table and join at query time.