Skip to content

Troubleshooting

Skill: databricks-synthetic-data-gen

You can diagnose and fix the errors that come up during synthetic data generation without trial-and-error. The most common issues fall into three categories: serverless compute limitations (.cache() not supported), dependency installation failures (ModuleNotFoundError), and performance problems (slow UDFs, out-of-memory). Each has a specific cause and a specific fix.

“My generation script fails with ‘PERSIST TABLE is not supported on serverless compute’ when I try to cache the customers table for FK joins. Fix it. Use Python with PySpark.”

# BAD -- fails on serverless
customers_df = spark.range(0, N_CUSTOMERS) # ... generate columns
customers_df.cache() # AnalysisException: PERSIST TABLE is not supported on serverless
# GOOD -- write to Delta, then read back
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers")

Key decisions:

  • Never use .cache() or .persist() on serverless — these operations are fundamentally not supported. This is not a bug, it is a serverless compute limitation.
  • Write to Delta, read back is the only reliable pattern for reusing a DataFrame across multiple downstream joins on serverless. It materializes the data once and makes it available for all subsequent reads.
  • Master tables first, child tables second — generate customers, write to Delta, read back the lookup columns, then generate orders with FK joins to the lookup.

“My Pandas UDF fails with ModuleNotFoundError: No module named ‘faker’ on the executor. Fix it. Use Python.”

# For Databricks Connect 16.4+
from databricks.connect import DatabricksSession, DatabricksEnv
env = DatabricksEnv().withDependencies("faker", "pandas", "numpy", "holidays")
spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()

The ModuleNotFoundError happens because Faker is installed on the driver but not on the executor. DatabricksEnv().withDependencies() pushes dependencies to the remote compute. For serverless jobs, use the environments spec with "client": "4". For classic clusters, install libraries via the CLI.

“I get ImportError: cannot import name ‘DatabricksEnv’ from databricks.connect. Fix it. Use bash.”

Terminal window
# Check your version
uv run python -c "import importlib.metadata; print(importlib.metadata.version('databricks-connect'))"
# Upgrade to 16.4+
uv pip install "databricks-connect>=16.4,<17.4"

DatabricksEnv was introduced in Databricks Connect 16.4. Older versions do not have it. Upgrade, or fall back to installing libraries on a classic cluster.

“My data generation takes forever to produce 1M rows. Speed it up. Use Python with PySpark.”

# SLOW -- scalar UDF, one Faker call per row
@F.udf(returnType=StringType())
def slow_fake_name():
return Faker().name()
# FAST -- Pandas UDF, batch processing
@F.pandas_udf(StringType())
def fast_fake_name(ids: pd.Series) -> pd.Series:
fake = Faker()
return pd.Series([fake.name() for _ in range(len(ids))])

Scalar UDFs (@F.udf) invoke Python once per row with serialization overhead on each call. Pandas UDFs (@F.pandas_udf) process entire batches (typically thousands of rows) in a single call. For Faker-heavy generation, this is a 10-50x speedup.

“My generation script runs out of memory at 5M rows. Fix it. Use Python with PySpark.”

# Increase partitions to reduce per-partition memory
customers_df = spark.range(0, 5_000_000, numPartitions=64)

More partitions means smaller batches per executor. For datasets over 1M rows, use 64+ partitions. For 10M+, go to 128. The tradeoff is more task scheduling overhead, but that is negligible compared to OOM failures.

“I get ‘AttributeError: function object has no attribute partitionBy’ when trying to use a window function. Fix it. Use Python with PySpark.”

# WRONG -- F.window is for streaming tumbling/sliding windows
window_spec = F.window.partitionBy("account_id").orderBy("contact_id")
# AttributeError: 'function' object has no attribute 'partitionBy'
# CORRECT -- Window is for analytical window functions
from pyspark.sql.window import Window
window_spec = Window.partitionBy("account_id").orderBy("contact_id")
contacts_df = contacts_df.withColumn(
"is_primary",
F.row_number().over(window_spec) == 1,
)

F.window is a streaming function for time-based tumbling/sliding windows. Window from pyspark.sql.window is for analytical operations like row_number(), rank(), lead(), and lag(). They are completely different things with similar names.

“My orders table has customer_id values that don’t exist in the customers table. Fix the generation approach. Use Python with PySpark.”

# 1. Generate and WRITE master table first
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
# 2. Read back for FK lookups
customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers").select("customer_id", "tier")
# 3. Generate child table with valid FKs via join
orders_df = (
spark.range(0, N_ORDERS, numPartitions=PARTITIONS)
.select(
F.concat(F.lit("ORD-"), F.lpad(F.col("id").cast("string"), 6, "0")).alias("order_id"),
)
.crossJoin(customer_lookup.sample(fraction=3.0, withReplacement=True).limit(N_ORDERS))
)

Generating random customer_id strings for orders guarantees orphan records. Instead, write the customers table first, read it back, and join. The .sample(withReplacement=True) allows customers to have multiple orders while keeping every FK valid.

  • Using .cache() or .persist() on serverless — always fails. Write to Delta and read back. This is the single most common serverless error in data generation.
  • Uniform distributions for financial datanp.random.uniform(10, 1000) produces flat data that looks nothing like real transactions. Use np.random.lognormal for amounts, np.random.pareto for heavy-tailed distributions, and np.random.exponential for time-to-event data.
  • Missing weekend and holiday patterns — real business data drops on weekends and holidays. Without time-based weighting, your synthetic data has perfectly even daily counts, which is a clear tell.
  • Skipping verification — always verify after generation. Check row counts, distributions (groupBy("tier").count()), and referential integrity (left_anti join to find orphans). A 30-second check saves hours of debugging downstream.