Data Profiling
Skill: databricks-unity-catalog
What You Can Build
Section titled “What You Can Build”Data Profiling attaches quality monitors to your Unity Catalog tables. A monitor automatically computes column-level statistics, detects drift between time windows, and tracks ML model performance over time. The output lands in two Delta tables (profile metrics and drift metrics) that you can query, alert on, or visualize in dashboards. Instead of writing custom data quality checks, you configure a monitor and let it run on a schedule.
In Action
Section titled “In Action”“Using Python, create a TimeSeries quality monitor on an orders table that tracks daily and weekly drift, sliced by region.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.dataquality import ( Monitor, DataProfilingConfig, TimeSeriesConfig, AggregationGranularity,)
w = WorkspaceClient()
table_info = w.tables.get("analytics.gold.orders")schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")
monitor = w.data_quality.create_monitor( monitor=Monitor( object_type="table", object_id=table_info.table_id, data_profiling_config=DataProfilingConfig( assets_dir="/Workspace/Users/you@company.com/monitoring/orders", output_schema_id=schema_info.schema_id, time_series=TimeSeriesConfig( timestamp_column="order_date", granularities=[ AggregationGranularity.AGGREGATION_GRANULARITY_1_DAY, AggregationGranularity.AGGREGATION_GRANULARITY_1_WEEK, ], ), ), ),)Key decisions:
- TimeSeries monitors need a timestamp column and at least one granularity — the granularity controls the time window for statistical comparisons
- The monitor creates two output tables:
orders_profile_metricsandorders_drift_metricsin the output schema - Profile types: Snapshot (no time column needed, general-purpose), TimeSeries (drift over time), InferenceLog (ML model quality)
- The API uses
object_id(UUID), not table names — look up UUIDs viaw.tables.get()first
More Patterns
Section titled “More Patterns”Snapshot monitor for a general-purpose table
Section titled “Snapshot monitor for a general-purpose table”“Using Python, create a snapshot monitor for a table that doesn’t have a timestamp column.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.dataquality import ( Monitor, DataProfilingConfig, SnapshotConfig,)
w = WorkspaceClient()
table_info = w.tables.get("analytics.gold.products")schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")
monitor = w.data_quality.create_monitor( monitor=Monitor( object_type="table", object_id=table_info.table_id, data_profiling_config=DataProfilingConfig( assets_dir="/Workspace/Users/you@company.com/monitoring/products", output_schema_id=schema_info.schema_id, snapshot=SnapshotConfig(), ), ),)Snapshot monitors compute point-in-time statistics — null counts, distinct values, distributions — without requiring a time column. Good for dimension tables and reference data where you want to catch unexpected changes.
ML model monitoring with InferenceLog
Section titled “ML model monitoring with InferenceLog”“Using Python, monitor a classification model’s prediction quality, tracking drift by model version.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.dataquality import ( Monitor, DataProfilingConfig, InferenceLogConfig, AggregationGranularity, InferenceProblemType,)
w = WorkspaceClient()
table_info = w.tables.get("ml.serving.churn_predictions")schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")
monitor = w.data_quality.create_monitor( monitor=Monitor( object_type="table", object_id=table_info.table_id, data_profiling_config=DataProfilingConfig( assets_dir="/Workspace/Users/you@company.com/monitoring/churn_model", output_schema_id=schema_info.schema_id, inference_log=InferenceLogConfig( timestamp_column="prediction_timestamp", granularities=[ AggregationGranularity.AGGREGATION_GRANULARITY_1_HOUR, ], model_id_column="model_version", problem_type=InferenceProblemType.INFERENCE_PROBLEM_TYPE_CLASSIFICATION, prediction_column="prediction", label_column="actual_label", ), ), ),)InferenceLog monitors track prediction distributions, label drift, and per-model-version quality. The model_id_column lets you compare metrics across model versions side by side.
Schedule a monitor and query its output
Section titled “Schedule a monitor and query its output”“Using Python, create a scheduled monitor that runs daily at noon, then query the drift metrics.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.dataquality import ( Monitor, DataProfilingConfig, SnapshotConfig, CronSchedule,)
w = WorkspaceClient()
table_info = w.tables.get("analytics.gold.customers")schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}")
monitor = w.data_quality.create_monitor( monitor=Monitor( object_type="table", object_id=table_info.table_id, data_profiling_config=DataProfilingConfig( assets_dir="/Workspace/Users/you@company.com/monitoring/customers", output_schema_id=schema_info.schema_id, snapshot=SnapshotConfig(), schedule=CronSchedule( quartz_cron_expression="0 0 12 * * ?", timezone_id="UTC", ), ), ),)After the first refresh completes, query the output tables directly.
-- Latest profile metricsSELECT column_name, percent_null, distinct_count, mean, stddevFROM analytics.gold.customers_profile_metricsORDER BY window_end DESCLIMIT 50;
-- Drift scores between consecutive windowsSELECT column_name, drift_type, drift_statistic, drift_scoreFROM analytics.gold.customers_drift_metricsORDER BY window_end DESCLIMIT 50;The drift metrics table compares consecutive time windows using statistical tests (KS test, chi-squared, etc.) and flags significant changes.
Watch Out For
Section titled “Watch Out For”- Snapshot monitors have a 4TB table size limit — for tables larger than 4TB, switch to a TimeSeries profile type that processes data in time windows instead of scanning the entire table.
- TimeSeries and InferenceLog only process the last 30 days — older data is not included in metric calculations. If you need longer history, contact your account team about adjusting the window.
- Monitor refresh gets stuck in PENDING — this usually means no SQL warehouse is available to run the profiling queries. Ensure a warehouse is running or configure
warehouse_idin the monitor settings. - Anomaly detection is schema-level, not per-table — enabling anomaly detection monitors all tables in the schema at their update frequency. This requires
MANAGE SCHEMAorMANAGE CATALOGprivileges and serverless compute. - Deleting a monitor does not delete output tables — the profile and drift metrics tables persist after monitor deletion. Clean them up manually if you no longer need the historical data.
- The Python SDK package is still
databricks.lakehouse_monitoringfor legacy operations — the current API usesw.data_quality. Both work, but new code should usew.data_quality.