Data Profiling

Skill: databricks-unity-catalog

What You Can Build

Data Profiling attaches quality monitors to your Unity Catalog tables. A monitor automatically computes column-level statistics, detects drift between time windows, and tracks ML model performance over time. The output lands in two Delta tables (profile metrics and drift metrics) that you can query, alert on, or visualize in dashboards. Instead of writing custom data quality checks, you configure a monitor and let it run on a schedule.

In Action

“Using Python, create a TimeSeries quality monitor on an orders table that tracks daily and weekly drift, sliced by region.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.dataquality import (
    Monitor, DataProfilingConfig, TimeSeriesConfig, AggregationGranularity,
)

w = WorkspaceClient()

table_info = w.tables.get("analytics.gold.orders")
schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")

monitor = w.data_quality.create_monitor(
    monitor=Monitor(
        object_type="table",
        object_id=table_info.table_id,
        data_profiling_config=DataProfilingConfig(
            assets_dir="/Workspace/Users/you@company.com/monitoring/orders",
            output_schema_id=schema_info.schema_id,
            time_series=TimeSeriesConfig(
                timestamp_column="order_date",
                granularities=[
                    AggregationGranularity.AGGREGATION_GRANULARITY_1_DAY,
                    AggregationGranularity.AGGREGATION_GRANULARITY_1_WEEK,
                ],
            ),
        ),
    ),
)

Key decisions:

TimeSeries monitors need a timestamp column and at least one granularity — the granularity controls the time window for statistical comparisons
The monitor creates two output tables: orders_profile_metrics and orders_drift_metrics in the output schema
Profile types: Snapshot (no time column needed, general-purpose), TimeSeries (drift over time), InferenceLog (ML model quality)
The API uses object_id (UUID), not table names — look up UUIDs via w.tables.get() first

More Patterns

Snapshot monitor for a general-purpose table

“Using Python, create a snapshot monitor for a table that doesn’t have a timestamp column.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.dataquality import (
    Monitor, DataProfilingConfig, SnapshotConfig,
)

w = WorkspaceClient()

table_info = w.tables.get("analytics.gold.products")
schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")

monitor = w.data_quality.create_monitor(
    monitor=Monitor(
        object_type="table",
        object_id=table_info.table_id,
        data_profiling_config=DataProfilingConfig(
            assets_dir="/Workspace/Users/you@company.com/monitoring/products",
            output_schema_id=schema_info.schema_id,
            snapshot=SnapshotConfig(),
        ),
    ),
)

Snapshot monitors compute point-in-time statistics — null counts, distinct values, distributions — without requiring a time column. Good for dimension tables and reference data where you want to catch unexpected changes.

ML model monitoring with InferenceLog

“Using Python, monitor a classification model’s prediction quality, tracking drift by model version.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.dataquality import (
    Monitor, DataProfilingConfig, InferenceLogConfig,
    AggregationGranularity, InferenceProblemType,
)

w = WorkspaceClient()

table_info = w.tables.get("ml.serving.churn_predictions")
schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")

monitor = w.data_quality.create_monitor(
    monitor=Monitor(
        object_type="table",
        object_id=table_info.table_id,
        data_profiling_config=DataProfilingConfig(
            assets_dir="/Workspace/Users/you@company.com/monitoring/churn_model",
            output_schema_id=schema_info.schema_id,
            inference_log=InferenceLogConfig(
                timestamp_column="prediction_timestamp",
                granularities=[
                    AggregationGranularity.AGGREGATION_GRANULARITY_1_HOUR,
                ],
                model_id_column="model_version",
                problem_type=InferenceProblemType.INFERENCE_PROBLEM_TYPE_CLASSIFICATION,
                prediction_column="prediction",
                label_column="actual_label",
            ),
        ),
    ),
)

InferenceLog monitors track prediction distributions, label drift, and per-model-version quality. The model_id_column lets you compare metrics across model versions side by side.

Schedule a monitor and query its output

“Using Python, create a scheduled monitor that runs daily at noon, then query the drift metrics.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.dataquality import (
    Monitor, DataProfilingConfig, SnapshotConfig, CronSchedule,
)

w = WorkspaceClient()

table_info = w.tables.get("analytics.gold.customers")
schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")

monitor = w.data_quality.create_monitor(
    monitor=Monitor(
        object_type="table",
        object_id=table_info.table_id,
        data_profiling_config=DataProfilingConfig(
            assets_dir="/Workspace/Users/you@company.com/monitoring/customers",
            output_schema_id=schema_info.schema_id,
            snapshot=SnapshotConfig(),
            schedule=CronSchedule(
                quartz_cron_expression="0 0 12 * * ?",
                timezone_id="UTC",
            ),
        ),
    ),
)

After the first refresh completes, query the output tables directly.

-- Latest profile metrics
SELECT column_name, percent_null, distinct_count, mean, stddev
FROM analytics.gold.customers_profile_metrics
ORDER BY window_end DESC
LIMIT 50;

-- Drift scores between consecutive windows
SELECT column_name, drift_type, drift_statistic, drift_score
FROM analytics.gold.customers_drift_metrics
ORDER BY window_end DESC
LIMIT 50;

The drift metrics table compares consecutive time windows using statistical tests (KS test, chi-squared, etc.) and flags significant changes.

Watch Out For

Snapshot monitors have a 4TB table size limit — for tables larger than 4TB, switch to a TimeSeries profile type that processes data in time windows instead of scanning the entire table.
TimeSeries and InferenceLog only process the last 30 days — older data is not included in metric calculations. If you need longer history, contact your account team about adjusting the window.
Monitor refresh gets stuck in PENDING — this usually means no SQL warehouse is available to run the profiling queries. Ensure a warehouse is running or configure warehouse_id in the monitor settings.
Anomaly detection is schema-level, not per-table — enabling anomaly detection monitors all tables in the schema at their update frequency. This requires MANAGE SCHEMA or MANAGE CATALOG privileges and serverless compute.
Deleting a monitor does not delete output tables — the profile and drift metrics tables persist after monitor deletion. Clean them up manually if you no longer need the historical data.
The Python SDK package is still databricks.lakehouse_monitoring for legacy operations — the current API uses w.data_quality. Both work, but new code should use w.data_quality.