Medallion Pipeline with CI/CD

Overview

Build a production-ready medallion architecture (Bronze, Silver, Gold) using Spark Declarative Pipelines, schedule it with a Databricks Job, and package everything as an Asset Bundle for CI/CD deployment through GitHub Actions.

Skills used: databricks-spark-declarative-pipelines, databricks-jobs, databricks-aibi-dashboards, databricks-bundles MCP tools used: create_or_update_pipeline, start_update, manage_jobs, create_or_update_dashboard, execute_sql

Prerequisites

A Databricks workspace with Unity Catalog enabled
Source data files in a Volume (CSV, JSON, or Parquet)
A GitHub repository for the Asset Bundle code
GitHub Actions configured with Databricks OIDC authentication

Steps

Create the medallion pipeline

Build a three-layer SDP pipeline with ingestion, cleaning, and aggregation.

Build a medallion architecture pipeline with:
- Bronze: Auto Loader ingestion from /Volumes/main/raw/csv_files
- Silver: cleaned and deduplicated data with expectations (non-null ID,
  valid dates, positive amounts)
- Gold: aggregated daily summary materialized view
Name it "sales_medallion" targeting the main catalog.

Add data quality expectations

Strengthen the pipeline with quality gates that quarantine bad records.

Update my "sales_medallion" pipeline to add expectations that quarantine
bad records to a separate table instead of failing the pipeline. Track
null rates and schema violations.

Run and validate the pipeline

Trigger a full refresh and verify the data flows correctly.

Start a full refresh on the pipeline "sales_medallion" and monitor it
until completion. Then show me row counts for each layer
(bronze, silver, gold).

Schedule with a Databricks Job

Wrap the pipeline in a Job with monitoring and retry logic.

Create a Databricks job that:
1. Triggers the "sales_medallion" pipeline on an hourly schedule
2. Runs a validation notebook after the pipeline completes
3. Retries failed tasks up to 2 times
4. Sends email notifications on failure to team@company.com

Build an analytics dashboard

Visualize the Gold layer with an AI/BI dashboard.

Create an AI/BI dashboard with a dataset querying the gold layer and add:
- A counter widget showing total revenue
- A bar chart showing revenue by product category
- A line chart showing daily revenue trends over the last 90 days

Package as a Databricks Asset Bundle

Define the entire stack as infrastructure-as-code for repeatable deployment.

Create a Databricks Asset Bundle that packages:
- The "sales_medallion" pipeline
- The scheduling job with notifications
- The analytics dashboard
With separate dev and prod targets using different catalogs and warehouse IDs.

What You Get

SDP pipeline with Bronze (Auto Loader), Silver (cleaned + deduplicated), and Gold (aggregated) layers
Data quality expectations that quarantine bad records for review
Databricks Job running hourly with retry logic and failure alerts
AI/BI dashboard visualizing the Gold layer
Asset Bundle (databricks.yml) ready for CI/CD deployment to dev and prod

Next Steps

Set up GitHub Actions CI/CD to deploy the bundle on merge
Add Genie Spaces for natural-language exploration of the Gold layer
Configure Data Quality Monitoring on the production tables
Use Metric Views to define governed KPIs on the Gold tables