Synthetic Data
Skill: databricks-synthetic-data-gen
What You Can Build
Section titled “What You Can Build”You can generate realistic, story-driven synthetic datasets — from thousands to millions of rows — using Spark, Faker, and Pandas UDFs on serverless compute. Your AI coding assistant designs the table schema with realistic distributions, referential integrity, and time patterns, then generates a complete script that creates infrastructure, produces data, and writes to Unity Catalog in one execution. The data is ready for downstream pipelines, Genie Spaces, or Knowledge Assistant demos.
In Action
Section titled “In Action”“Generate 10,000 customers and 50,000 orders for an e-commerce demo. Enterprise customers should generate 5x more orders with higher amounts. Write to my Unity Catalog.”
from databricks.connect import DatabricksSession, DatabricksEnvfrom pyspark.sql import functions as Ffrom pyspark.sql.types import StringType, DoubleTypeimport pandas as pdimport numpy as np
CATALOG = "my_catalog"SCHEMA = "ecommerce_demo"
# Serverless with managed dependenciesenv = DatabricksEnv().withDependencies("faker", "pandas", "numpy")spark = (DatabricksSession.builder .withEnvironment(env) .serverless(True) .getOrCreate())
# Create infrastructurespark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
# Pandas UDFs for realistic data@F.pandas_udf(StringType())def fake_name(ids: pd.Series) -> pd.Series: from faker import Faker fake = Faker() return pd.Series([fake.name() for _ in range(len(ids))])
@F.pandas_udf(DoubleType())def generate_amount(tiers: pd.Series) -> pd.Series: amounts = [] for tier in tiers: if tier == "Enterprise": amounts.append(float(np.random.lognormal(7.5, 0.8))) elif tier == "Pro": amounts.append(float(np.random.lognormal(5.5, 0.7))) else: amounts.append(float(np.random.lognormal(4.0, 0.6))) return pd.Series(amounts)
# Generate customers with weighted tier distributioncustomers_df = ( spark.range(0, 10000, numPartitions=16) .select( F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")) .alias("customer_id"), fake_name(F.col("id")).alias("name"), F.when(F.rand() < 0.6, "Free") .when(F.rand() < 0.75, "Pro") .otherwise("Enterprise").alias("tier"), ))
# Write master table first — FK joins require itcustomers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")Key decisions:
- Spark + Faker + Pandas UDFs — Faker runs inside Pandas UDFs for parallel execution across partitions. Pure Python Faker is single-threaded and painfully slow above a few thousand rows.
- Serverless with
DatabricksEnv().withDependencies()— installs Faker and NumPy on serverless compute without managing a cluster. Available on databricks-connect 16.4+. - Log-normal distributions —
np.random.lognormal()produces realistic price/amount distributions with a long right tail. Enterprise customers center around $1,800, Pro around $245, Free around $55. - Weighted tiers via chained
F.when()— 60% Free, 15% Pro, 25% Enterprise. Adjust theF.rand()thresholds to shift the mix. - Master table written to Delta first — child tables join against this. Never use
.cache()or.persist()on serverless; it throwsAnalysisException.
More Patterns
Section titled “More Patterns”Referential integrity with FK joins
Section titled “Referential integrity with FK joins”“Generate 50,000 orders that reference the customer table with realistic date spreads.”
from datetime import datetime, timedelta
END_DATE = datetime.now()START_DATE = END_DATE - timedelta(days=180)
# Read back the master table for FK joinscustomers = spark.read.table(f"{CATALOG}.{SCHEMA}.customers")
# Generate orders with valid customer_id referencesorders_df = ( spark.range(0, 50000, numPartitions=32) .select( F.concat(F.lit("ORD-"), F.lpad(F.col("id").cast("string"), 6, "0")) .alias("order_id"), F.date_add(F.lit(START_DATE.date()), (F.rand() * 180).cast("int")) .alias("order_date"), F.when(F.rand() < 0.65, "delivered") .when(F.rand() < 0.80, "shipped") .when(F.rand() < 0.90, "processing") .when(F.rand() < 0.95, "pending") .otherwise("cancelled").alias("status"), ) # Join to get valid customer_ids — round-robin assignment .withColumn("row_num", F.monotonically_increasing_id()) .join( customers.select("customer_id", "tier") .withColumn("cust_row", F.monotonically_increasing_id()), F.col("row_num") % customers.count() == F.col("cust_row"), ) .withColumn("amount", generate_amount(F.col("tier"))) .drop("row_num", "cust_row", "tier"))
orders_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.orders")Write the master table to Delta first, then read it back for FK joins. This pattern avoids .cache() (which fails on serverless) and guarantees referential integrity across tables.
Time-series IoT sensor data
Section titled “Time-series IoT sensor data”“Generate 100,000 rows of IoT sensor readings with realistic timestamp patterns and occasional anomalies.”
@F.pandas_udf(DoubleType())def sensor_reading(timestamps: pd.Series) -> pd.Series: readings = [] for ts in timestamps: base = 22.0 + 3.0 * np.sin(ts.hour * np.pi / 12) # daily cycle noise = np.random.normal(0, 0.5) anomaly = np.random.choice([0, 15], p=[0.98, 0.02]) # 2% anomalies readings.append(base + noise + anomaly) return pd.Series(readings)
sensors_df = ( spark.range(0, 100000, numPartitions=32) .select( F.concat(F.lit("SENSOR-"), (F.col("id") % 50).cast("string")) .alias("sensor_id"), F.from_unixtime( F.lit(int(START_DATE.timestamp())) + F.col("id") * 60 ).cast("timestamp").alias("reading_time"), ) .withColumn("temperature_c", sensor_reading(F.col("reading_time"))))Sinusoidal base patterns with Gaussian noise produce realistic time-series data. The 2% anomaly injection gives data quality tools something to catch.
Output to Volumes as Parquet
Section titled “Output to Volumes as Parquet”“Write the generated data to a Volume as Parquet files for pipeline ingestion.”
# Parquet to a Volume — ready for SDP pipeline ingestioncustomers_df.write.mode("overwrite").parquet( f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")
orders_df.write.mode("overwrite").parquet( f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/orders")Parquet in Volumes is the standard landing zone for SDP pipeline ingestion via Auto Loader. Delta tables are better when downstream consumers query directly.
Watch Out For
Section titled “Watch Out For”- Never use
.cache()or.persist()on serverless — throwsAnalysisException: PERSIST TABLE is not supported on serverless compute. Write master tables to Delta first, then read them back for joins. - Always ask for catalog and schema — never default to any catalog value. The user must confirm the target location before generation begins.
F.windowvsWindow—F.window()is for streaming time windows. For analytical functions likerow_number()orrank(), importfrom pyspark.sql.window import Window.- Faker inside UDFs must re-import — each Spark partition runs in its own Python process. Put
from faker import Fakerinside the UDF body, not at module level.