Data Modeling & Best Practices
Skill: databricks-dbsql
What You Can Build
Section titled “What You Can Build”Good data modeling and clustering configuration are the difference between sub-second and multi-minute dashboard loads. Star schemas in the Gold layer, Liquid Clustering on the right columns, and a few optimization habits give you consistently fast analytics without manual tuning. The patterns here apply to any Databricks SQL workload — dashboards, ad-hoc queries, and materialized views.
In Action
Section titled “In Action”“Write SQL that creates a star schema with a fact table and a customer dimension table, both using Liquid Clustering and proper constraints.”
CREATE TABLE gold.sales.fact_orders ( order_key BIGINT GENERATED ALWAYS AS IDENTITY, customer_key BIGINT NOT NULL, product_key BIGINT NOT NULL, date_key INT NOT NULL, order_amount DECIMAL(18,2), quantity INT, CONSTRAINT fk_customer FOREIGN KEY (customer_key) REFERENCES gold.sales.dim_customer(customer_key), CONSTRAINT fk_product FOREIGN KEY (product_key) REFERENCES gold.sales.dim_product(product_key)) CLUSTER BY (date_key, customer_key);
CREATE TABLE gold.sales.dim_customer ( customer_key BIGINT GENERATED ALWAYS AS IDENTITY, customer_id STRING NOT NULL COMMENT 'Natural key from source system', full_name STRING, email STRING, city STRING, state STRING, segment STRING, effective_start_date TIMESTAMP, effective_end_date TIMESTAMP, is_current BOOLEAN, CONSTRAINT pk_customer PRIMARY KEY (customer_key)) CLUSTER BY (customer_key, segment);Key decisions:
GENERATED ALWAYS AS IDENTITYcreates integer surrogate keys — faster joins than string keysDECIMAL(18,2)for financial amounts avoids floating-point precision errorsFOREIGN KEYconstraints inform the optimizer about table relationships (they’re not enforced at write time)CLUSTER BYreplaces both partitioning and Z-ORDER with a single, incremental mechanism- Fact tables cluster by the most commonly filtered foreign keys; dimension tables cluster by primary key plus common filter columns
More Patterns
Section titled “More Patterns”Choose between star schema and One Big Table
Section titled “Choose between star schema and One Big Table”“Help me decide whether to use a star schema or a wide denormalized table for my Gold layer.”
In benchmarks, dimensional models outperformed One Big Tables (2.6s vs 3.5s) despite requiring joins, because fewer files needed scanning. With Liquid Clustering applied, OBT achieved 1.13s. Both approaches hit sub-500ms with automatic caching after the first run.
Use star schemas when you filter on 5+ dimensions or need SCD tracking. Use OBT when you filter on 1-3 dimensions and want simpler governance. The recommended approach: OBT or Data Vault in Silver for fast integration, star schemas in Gold as the curated presentation layer for BI.
Collect statistics for the optimizer
Section titled “Collect statistics for the optimizer”“Write SQL to compute column statistics on a fact table so the query optimizer can make better join and filter decisions.”
ANALYZE TABLE gold.sales.fact_orders COMPUTE STATISTICS FOR COLUMNS date_key, customer_key, product_key;Run ANALYZE TABLE after bulk loads, table overwrites, or major data changes. Focus on columns used in WHERE clauses, JOINs, and GROUP BY. Missing statistics lead to suboptimal query plans — the optimizer guesses cardinality instead of knowing it.
Enable predictive optimization
Section titled “Enable predictive optimization”“Write SQL to enable automatic OPTIMIZE and VACUUM on a managed table.”
ALTER TABLE gold.sales.fact_ordersSET TBLPROPERTIES ('delta.enableOptimizeWrite' = 'true');On serverless SQL warehouses, optimizeWrite and autoCompact are managed automatically — you don’t need to set them. For classic compute, enable them explicitly. For Unity Catalog managed tables, predictive optimization handles OPTIMIZE and VACUUM scheduling automatically.
Write cost-efficient queries
Section titled “Write cost-efficient queries”“Show me the SQL patterns that maximize query result caching and minimize scan volume.”
-- Use CREATE OR REPLACE instead of DELETE + CREATECREATE OR REPLACE TABLE gold.sales.daily_summary ASSELECT date_key, SUM(order_amount) AS total_revenue, COUNT(DISTINCT customer_key) AS unique_customersFROM gold.sales.fact_ordersGROUP BY date_key;
-- Use QUALIFY instead of subquery for window filteringSELECT *, ROW_NUMBER() OVER (PARTITION BY customer_key ORDER BY order_amount DESC) AS rnFROM gold.sales.fact_ordersQUALIFY rn = 1;CREATE OR REPLACE TABLE preserves time travel and avoids reader interruptions. QUALIFY eliminates the need for a wrapping subquery. Avoid NOW(), RAND(), and other non-deterministic functions in filters — they prevent query result caching.
Watch Out For
Section titled “Watch Out For”- Skipping dimensional modeling in the Gold layer — OBTs are fine for Silver, but Gold should use star schemas when you need multi-dimensional analysis. The upfront modeling effort pays for itself in query performance and governance clarity.
- Over-partitioning tables — more than 5,000 partitions degrades performance. Use Liquid Clustering instead, which handles 1-4 keys and adjusts incrementally without reorganizing the entire table.
- Using Python UDFs when native SQL functions exist — UDFs require serialization between Python and Spark, adding significant overhead. Check for a native function first.
- Using FLOAT for financial data —
FLOATandDOUBLEintroduce precision errors in aggregations. Always useDECIMAL(18,2)or similar for monetary values. - Missing PK/FK constraints — even though they’re not enforced at write time, constraints give the optimizer critical information about table relationships and eliminate unnecessary joins.