Expectations

Skill: databricks-spark-declarative-pipelines

What You Can Build

Expectations are data quality constraints you declare directly on pipeline tables. Each row is evaluated against your rules, and you choose the enforcement level: log the violation and keep the row, silently drop it, or halt the pipeline entirely. No separate validation framework, no after-the-fact auditing — quality enforcement happens at write time.

In Action

“Add row-level validation to a SQL silver table that drops orders with negative amounts and null customer IDs”

CREATE STREAMING TABLE silver_orders (
  CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
  CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW
)
AS SELECT * FROM STREAM(bronze_orders);

Key decisions:

DROP ROW over FAIL UPDATE — for a silver table, you want to keep the pipeline running and route bad data out. Failing the entire pipeline on one malformed row is usually too aggressive unless the data is safety-critical.
Named constraints — valid_amount and valid_customer show up in the pipeline UI monitoring, so you can track violation rates per rule over time.
Inline with the table definition — expectations are part of the DDL, not a separate validation step. This means they can’t be skipped or forgotten when someone modifies the query.

More Patterns

Log violations without dropping data

“Track how many events have null timestamps without filtering them out, using Python”

@dp.expect("valid_timestamp", "col('timestamp') IS NOT NULL")
@dp.table()
def silver_events():
    return spark.readStream.table("bronze_events")

The expect decorator (without _or_drop or _or_fail) lets every row through but records the violation in pipeline monitoring. Use this when you’re still learning the shape of your data and want visibility before enforcing hard rules.

Halt the pipeline on critical violations

“Fail the pipeline immediately if any record arrives with a null primary key, in SQL”

CREATE STREAMING TABLE critical_data (
  CONSTRAINT require_id EXPECT (id IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(source);

FAIL UPDATE stops the pipeline the moment a violation is detected. Reserve this for invariants that, if broken, mean something is fundamentally wrong upstream — like a null primary key that would corrupt every downstream join.

Apply multiple constraints as a group

“Drop rows that fail any of three validation rules on an orders table using Python”

@dp.expect_all_or_drop({
    "valid_amount": "amount > 0",
    "valid_customer": "customer_id IS NOT NULL",
    "valid_date": "order_date >= '2020-01-01'"
})
@dp.table()
def clean_orders():
    return spark.readStream.table("bronze_orders")

expect_all_or_drop is a convenience that applies the same enforcement level to every rule in the dictionary. A row is dropped if it fails any single constraint. Each rule still appears individually in the monitoring dashboard so you can see which one is triggering most often.

Mix enforcement levels on the same table

“Warn on missing emails, drop rows with invalid amounts, and fail on null IDs — all on the same table, in Python”

@dp.expect("has_email", "email IS NOT NULL")
@dp.expect_or_drop("valid_amount", "amount > 0")
@dp.expect_or_fail("has_id", "id IS NOT NULL")
@dp.table()
def validated_orders():
    return spark.readStream.table("bronze_orders")

You can stack decorators with different enforcement levels. The pipeline evaluates all of them on each row — a row that fails the expect_or_fail check halts the pipeline regardless of whether it passes the other two.

Watch Out For

Expectations work on both streaming tables and materialized views — they’re not streaming-only. Apply them anywhere you want row-level quality checks.
expect_all variants require a dictionary — expect_all_or_drop({"name": "expr", ...}) in Python. If you pass a single string instead of a dict, you’ll get a confusing type error.
Constraint expressions use SQL syntax even in Python — the second argument to @dp.expect is a SQL expression string ("amount > 0"), not a Python expression. Column references follow SQL rules.
Dropping rows is silent — DROP ROW doesn’t raise errors or warnings in logs. The only way to see how many rows were dropped is through the pipeline UI or the expectations metrics table. If your drop rate spikes, you won’t know unless you’re monitoring.