Skip to content

Reverse ETL

Skill: databricks-lakebase-autoscale

You can sync data from Unity Catalog Delta tables into Lakebase Autoscaling as PostgreSQL tables, enabling low-latency OLTP access patterns on data processed in the lakehouse. Product catalogs, user profiles, feature stores — any Gold-layer table you need to serve at sub-second latency can flow through a managed sync pipeline.

“Using Python, create a triggered synced table that copies user profiles from a Delta table to Lakebase PostgreSQL.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.database import (
SyncedDatabaseTable,
SyncedTableSpec,
NewPipelineSpec,
SyncedTableSchedulingPolicy,
)
w = WorkspaceClient()
synced_table = w.database.create_synced_database_table(
SyncedDatabaseTable(
name="lakebase_catalog.schema.user_profiles",
spec=SyncedTableSpec(
source_table_full_name="analytics.gold.user_profiles",
primary_key_columns=["user_id"],
scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED,
new_pipeline_spec=NewPipelineSpec(
storage_catalog="lakebase_catalog",
storage_schema="staging",
),
),
)
)

Key decisions:

  • Triggered mode syncs on demand — run it hourly or daily for dashboards. Continuous mode streams with seconds of latency for live apps. Snapshot does a one-time full copy for initial setup.
  • Triggered and Continuous modes require Change Data Feed (CDF) enabled on the source Delta table
  • Each synced table uses up to 16 connections and creates a managed Spark Declarative Pipeline under the hood
  • Capacity limits: 2 TB total across all synced tables, recommended less than 1 TB per table

Enabling Change Data Feed on the source table

Section titled “Enabling Change Data Feed on the source table”

“Using SQL, enable CDF on a Delta table before setting up triggered or continuous sync.”

ALTER TABLE analytics.gold.user_profiles
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

CDF tracks row-level changes (inserts, updates, deletes) so the sync pipeline can process only the delta instead of re-copying the full table. Without CDF, triggered and continuous modes fail at creation time.

Real-time feature serving with continuous sync

Section titled “Real-time feature serving with continuous sync”

“Using Python, set up a continuous synced table for serving ML features at low latency.”

w.database.create_synced_database_table(
SyncedDatabaseTable(
name="ml_catalog.public.user_features",
spec=SyncedTableSpec(
source_table_full_name="ml.features.user_features",
primary_key_columns=["user_id"],
scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS,
),
)
)

Continuous mode delivers changes with seconds of latency, making it suitable for feature stores that serve real-time inference requests. The trade-off is higher cost and a minimum sync interval of 15 seconds.

“Using Python, verify a synced table is healthy and processing changes.”

status = w.database.get_synced_database_table(
name="lakebase_catalog.schema.user_profiles"
)
print(f"State: {status.data_synchronization_status.detailed_state}")
print(f"Message: {status.data_synchronization_status.message}")

Monitor sync status after creation and periodically in production. The detailed_state field tells you whether the pipeline is initializing, running, or failed.

“Which mode fits my workload?”

# Snapshot -- one-time full copy
# Best for: initial data load, small reference tables, historical backfills
# 10x more efficient than incremental if modifying >10% of data
# Triggered -- on-demand scheduled sync
# Best for: dashboards refreshed hourly/daily, batch reporting tables
# Requires CDF on source table
# Continuous -- real-time streaming
# Best for: live apps, feature stores, real-time dashboards
# Highest cost, minimum 15s intervals, requires CDF on source table

Start with Triggered for most use cases. Move to Continuous only when your application genuinely needs sub-minute latency from the lakehouse to the serving layer.

  • Forgetting to enable CDF — creating a Triggered or Continuous synced table against a source without Change Data Feed produces an immediate error. Enable CDF on the source table first.
  • Non-additive schema changes on streaming syncs — Triggered and Continuous modes support only additive schema evolution (adding columns). Renaming or removing columns requires deleting and recreating the synced table.
  • Exceeding connection limits — each synced table uses up to 16 connections. If your Lakebase compute has a low max-connection limit, a handful of synced tables can exhaust it. Factor synced table connections into your compute sizing.
  • Trying to write to synced tables — synced tables are read-only in Lakebase. The sync pipeline owns the data. If your app needs to write additional columns or rows, create a separate table and join at query time.