Synthetic Data

Skill: databricks-synthetic-data-gen

What You Can Build

You can generate realistic, story-driven synthetic datasets — from thousands to millions of rows — using Spark, Faker, and Pandas UDFs on serverless compute. Your AI coding assistant designs the table schema with realistic distributions, referential integrity, and time patterns, then generates a complete script that creates infrastructure, produces data, and writes to Unity Catalog in one execution. The data is ready for downstream pipelines, Genie Spaces, or Knowledge Assistant demos.

In Action

“Generate 10,000 customers and 50,000 orders for an e-commerce demo. Enterprise customers should generate 5x more orders with higher amounts. Write to my Unity Catalog.”

from databricks.connect import DatabricksSession, DatabricksEnv
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DoubleType
import pandas as pd
import numpy as np

CATALOG = "my_catalog"
SCHEMA = "ecommerce_demo"

# Serverless with managed dependencies
env = DatabricksEnv().withDependencies("faker", "pandas", "numpy")
spark = (DatabricksSession.builder
    .withEnvironment(env)
    .serverless(True)
    .getOrCreate())

# Create infrastructure
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")

# Pandas UDFs for realistic data
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
    from faker import Faker
    fake = Faker()
    return pd.Series([fake.name() for _ in range(len(ids))])

@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
    amounts = []
    for tier in tiers:
        if tier == "Enterprise":
            amounts.append(float(np.random.lognormal(7.5, 0.8)))
        elif tier == "Pro":
            amounts.append(float(np.random.lognormal(5.5, 0.7)))
        else:
            amounts.append(float(np.random.lognormal(4.0, 0.6)))
    return pd.Series(amounts)

# Generate customers with weighted tier distribution
customers_df = (
    spark.range(0, 10000, numPartitions=16)
    .select(
        F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0"))
            .alias("customer_id"),
        fake_name(F.col("id")).alias("name"),
        F.when(F.rand() < 0.6, "Free")
         .when(F.rand() < 0.75, "Pro")
         .otherwise("Enterprise").alias("tier"),
    )
)

# Write master table first — FK joins require it
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")

Key decisions:

Spark + Faker + Pandas UDFs — Faker runs inside Pandas UDFs for parallel execution across partitions. Pure Python Faker is single-threaded and painfully slow above a few thousand rows.
Serverless with DatabricksEnv().withDependencies() — installs Faker and NumPy on serverless compute without managing a cluster. Available on databricks-connect 16.4+.
Log-normal distributions — np.random.lognormal() produces realistic price/amount distributions with a long right tail. Enterprise customers center around $1,800, Pro around $245, Free around $55.
Weighted tiers via chained F.when() — 60% Free, 15% Pro, 25% Enterprise. Adjust the F.rand() thresholds to shift the mix.
Master table written to Delta first — child tables join against this. Never use .cache() or .persist() on serverless; it throws AnalysisException.

More Patterns

Referential integrity with FK joins

“Generate 50,000 orders that reference the customer table with realistic date spreads.”

from datetime import datetime, timedelta

END_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)

# Read back the master table for FK joins
customers = spark.read.table(f"{CATALOG}.{SCHEMA}.customers")

# Generate orders with valid customer_id references
orders_df = (
    spark.range(0, 50000, numPartitions=32)
    .select(
        F.concat(F.lit("ORD-"), F.lpad(F.col("id").cast("string"), 6, "0"))
            .alias("order_id"),
        F.date_add(F.lit(START_DATE.date()), (F.rand() * 180).cast("int"))
            .alias("order_date"),
        F.when(F.rand() < 0.65, "delivered")
         .when(F.rand() < 0.80, "shipped")
         .when(F.rand() < 0.90, "processing")
         .when(F.rand() < 0.95, "pending")
         .otherwise("cancelled").alias("status"),
    )
    # Join to get valid customer_ids — round-robin assignment
    .withColumn("row_num", F.monotonically_increasing_id())
    .join(
        customers.select("customer_id", "tier")
            .withColumn("cust_row", F.monotonically_increasing_id()),
        F.col("row_num") % customers.count() == F.col("cust_row"),
    )
    .withColumn("amount", generate_amount(F.col("tier")))
    .drop("row_num", "cust_row", "tier")
)

orders_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.orders")

Write the master table to Delta first, then read it back for FK joins. This pattern avoids .cache() (which fails on serverless) and guarantees referential integrity across tables.

Time-series IoT sensor data

“Generate 100,000 rows of IoT sensor readings with realistic timestamp patterns and occasional anomalies.”

@F.pandas_udf(DoubleType())
def sensor_reading(timestamps: pd.Series) -> pd.Series:
    readings = []
    for ts in timestamps:
        base = 22.0 + 3.0 * np.sin(ts.hour * np.pi / 12)  # daily cycle
        noise = np.random.normal(0, 0.5)
        anomaly = np.random.choice([0, 15], p=[0.98, 0.02])  # 2% anomalies
        readings.append(base + noise + anomaly)
    return pd.Series(readings)

sensors_df = (
    spark.range(0, 100000, numPartitions=32)
    .select(
        F.concat(F.lit("SENSOR-"), (F.col("id") % 50).cast("string"))
            .alias("sensor_id"),
        F.from_unixtime(
            F.lit(int(START_DATE.timestamp())) + F.col("id") * 60
        ).cast("timestamp").alias("reading_time"),
    )
    .withColumn("temperature_c", sensor_reading(F.col("reading_time")))
)

Sinusoidal base patterns with Gaussian noise produce realistic time-series data. The 2% anomaly injection gives data quality tools something to catch.

Output to Volumes as Parquet

“Write the generated data to a Volume as Parquet files for pipeline ingestion.”

# Parquet to a Volume — ready for SDP pipeline ingestion
customers_df.write.mode("overwrite").parquet(
    f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers"
)

orders_df.write.mode("overwrite").parquet(
    f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/orders"
)

Parquet in Volumes is the standard landing zone for SDP pipeline ingestion via Auto Loader. Delta tables are better when downstream consumers query directly.

Watch Out For

Never use .cache() or .persist() on serverless — throws AnalysisException: PERSIST TABLE is not supported on serverless compute. Write master tables to Delta first, then read them back for joins.
Always ask for catalog and schema — never default to any catalog value. The user must confirm the target location before generation begins.
F.window vs Window — F.window() is for streaming time windows. For analytical functions like row_number() or rank(), import from pyspark.sql.window import Window.
Faker inside UDFs must re-import — each Spark partition runs in its own Python process. Put from faker import Faker inside the UDF body, not at module level.