Troubleshooting
Skill: databricks-synthetic-data-gen
What You Can Build
Section titled “What You Can Build”You can diagnose and fix the errors that come up during synthetic data generation without trial-and-error. The most common issues fall into three categories: serverless compute limitations (.cache() not supported), dependency installation failures (ModuleNotFoundError), and performance problems (slow UDFs, out-of-memory). Each has a specific cause and a specific fix.
In Action
Section titled “In Action”“My generation script fails with ‘PERSIST TABLE is not supported on serverless compute’ when I try to cache the customers table for FK joins. Fix it. Use Python with PySpark.”
# BAD -- fails on serverlesscustomers_df = spark.range(0, N_CUSTOMERS) # ... generate columnscustomers_df.cache() # AnalysisException: PERSIST TABLE is not supported on serverless
# GOOD -- write to Delta, then read backcustomers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers")Key decisions:
- Never use
.cache()or.persist()on serverless — these operations are fundamentally not supported. This is not a bug, it is a serverless compute limitation. - Write to Delta, read back is the only reliable pattern for reusing a DataFrame across multiple downstream joins on serverless. It materializes the data once and makes it available for all subsequent reads.
- Master tables first, child tables second — generate customers, write to Delta, read back the lookup columns, then generate orders with FK joins to the lookup.
More Patterns
Section titled “More Patterns”Fix ModuleNotFoundError for Faker
Section titled “Fix ModuleNotFoundError for Faker”“My Pandas UDF fails with ModuleNotFoundError: No module named ‘faker’ on the executor. Fix it. Use Python.”
# For Databricks Connect 16.4+from databricks.connect import DatabricksSession, DatabricksEnv
env = DatabricksEnv().withDependencies("faker", "pandas", "numpy", "holidays")spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()The ModuleNotFoundError happens because Faker is installed on the driver but not on the executor. DatabricksEnv().withDependencies() pushes dependencies to the remote compute. For serverless jobs, use the environments spec with "client": "4". For classic clusters, install libraries via the CLI.
Fix DatabricksEnv Import Error
Section titled “Fix DatabricksEnv Import Error”“I get ImportError: cannot import name ‘DatabricksEnv’ from databricks.connect. Fix it. Use bash.”
# Check your versionuv run python -c "import importlib.metadata; print(importlib.metadata.version('databricks-connect'))"
# Upgrade to 16.4+uv pip install "databricks-connect>=16.4,<17.4"DatabricksEnv was introduced in Databricks Connect 16.4. Older versions do not have it. Upgrade, or fall back to installing libraries on a classic cluster.
Fix Slow Faker UDFs
Section titled “Fix Slow Faker UDFs”“My data generation takes forever to produce 1M rows. Speed it up. Use Python with PySpark.”
# SLOW -- scalar UDF, one Faker call per row@F.udf(returnType=StringType())def slow_fake_name(): return Faker().name()
# FAST -- Pandas UDF, batch processing@F.pandas_udf(StringType())def fast_fake_name(ids: pd.Series) -> pd.Series: fake = Faker() return pd.Series([fake.name() for _ in range(len(ids))])Scalar UDFs (@F.udf) invoke Python once per row with serialization overhead on each call. Pandas UDFs (@F.pandas_udf) process entire batches (typically thousands of rows) in a single call. For Faker-heavy generation, this is a 10-50x speedup.
Fix Out-of-Memory on Large Datasets
Section titled “Fix Out-of-Memory on Large Datasets”“My generation script runs out of memory at 5M rows. Fix it. Use Python with PySpark.”
# Increase partitions to reduce per-partition memorycustomers_df = spark.range(0, 5_000_000, numPartitions=64)More partitions means smaller batches per executor. For datasets over 1M rows, use 64+ partitions. For 10M+, go to 128. The tradeoff is more task scheduling overhead, but that is negligible compared to OOM failures.
Fix F.window vs Window Confusion
Section titled “Fix F.window vs Window Confusion”“I get ‘AttributeError: function object has no attribute partitionBy’ when trying to use a window function. Fix it. Use Python with PySpark.”
# WRONG -- F.window is for streaming tumbling/sliding windowswindow_spec = F.window.partitionBy("account_id").orderBy("contact_id")# AttributeError: 'function' object has no attribute 'partitionBy'
# CORRECT -- Window is for analytical window functionsfrom pyspark.sql.window import Windowwindow_spec = Window.partitionBy("account_id").orderBy("contact_id")
contacts_df = contacts_df.withColumn( "is_primary", F.row_number().over(window_spec) == 1,)F.window is a streaming function for time-based tumbling/sliding windows. Window from pyspark.sql.window is for analytical operations like row_number(), rank(), lead(), and lag(). They are completely different things with similar names.
Fix Referential Integrity Issues
Section titled “Fix Referential Integrity Issues”“My orders table has customer_id values that don’t exist in the customers table. Fix the generation approach. Use Python with PySpark.”
# 1. Generate and WRITE master table firstcustomers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
# 2. Read back for FK lookupscustomer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers").select("customer_id", "tier")
# 3. Generate child table with valid FKs via joinorders_df = ( spark.range(0, N_ORDERS, numPartitions=PARTITIONS) .select( F.concat(F.lit("ORD-"), F.lpad(F.col("id").cast("string"), 6, "0")).alias("order_id"), ) .crossJoin(customer_lookup.sample(fraction=3.0, withReplacement=True).limit(N_ORDERS)))Generating random customer_id strings for orders guarantees orphan records. Instead, write the customers table first, read it back, and join. The .sample(withReplacement=True) allows customers to have multiple orders while keeping every FK valid.
Watch Out For
Section titled “Watch Out For”- Using
.cache()or.persist()on serverless — always fails. Write to Delta and read back. This is the single most common serverless error in data generation. - Uniform distributions for financial data —
np.random.uniform(10, 1000)produces flat data that looks nothing like real transactions. Usenp.random.lognormalfor amounts,np.random.paretofor heavy-tailed distributions, andnp.random.exponentialfor time-to-event data. - Missing weekend and holiday patterns — real business data drops on weekends and holidays. Without time-based weighting, your synthetic data has perfectly even daily counts, which is a clear tell.
- Skipping verification — always verify after generation. Check row counts, distributions (
groupBy("tier").count()), and referential integrity (left_antijoin to find orphans). A 30-second check saves hours of debugging downstream.