Skip to content

Data Engineering

Skills: databricks-spark-declarative-pipelines, databricks-agent-skill-databricks-pipelines MCP Tools: create_or_update_pipeline, get_pipeline, start_update, get_update, stop_pipeline, get_pipeline_events

Create a new Spark Declarative Pipeline called "sales_ingestion" that reads JSON
files from a cloud storage volume using Auto Loader and writes to a streaming
table in main.bronze.
Build a medallion architecture pipeline with:
- Bronze: Auto Loader ingestion from /Volumes/main/raw/csv_files
- Silver: cleaned and deduplicated data with expectations
- Gold: aggregated daily summary materialized view
Create an SDP pipeline in Python that implements CDC (Change Data Capture) from
a source table to a target streaming table with SCD Type 2 tracking.
Write a SQL-based Spark Declarative Pipeline that:
1. Ingests from cloud storage using Auto Loader
2. Applies data quality expectations (non-null ID, valid dates)
3. Creates a materialized view with business aggregations
List all pipelines in my workspace and show their current status.
Start a full refresh on the pipeline named "sales_ingestion".
Check the status of the latest update for my pipeline and show any errors.
Show me recent events and errors for the pipeline "etl_daily" — help me debug
why it failed.
Stop the currently running pipeline "streaming_ingest".
Run a validation (dry-run) on my pipeline to check for errors without
materializing data.
Create an SDP pipeline that uses Auto CDC to capture changes from a source table
and maintain a Type 2 slowly-changing dimension.
Build a pipeline with data quality expectations that quarantine bad records to a
separate table instead of failing the pipeline.
Create a streaming table that reads from Kafka using Spark Structured Streaming
inside an SDP pipeline.

Skills: databricks-spark-structured-streaming MCP Tools: execute_databricks_command, run_python_file_on_databricks

Write a Spark Structured Streaming job that reads from Kafka topic "events",
deserializes JSON payloads, and writes to a Delta table with checkpointing.
Build a streaming pipeline with a 10-minute watermark and windowed aggregation
that computes event counts per 5-minute tumbling window.
Create a stream-stream join between an "orders" stream and a "shipments" stream
with a 1-hour time window for matching.
Write a streaming job that reads from a Delta table using Change Data Feed and
writes updates to two sink tables (a summary table and an audit log).
Build a streaming pipeline with Real-Time Mode (sub-second latency) for a live
dashboard use case.
Create a streaming job with `availableNow` trigger for cost-efficient
micro-batch processing of accumulated data.
Write a Spark Streaming job with foreachBatch that performs an upsert (MERGE)
into a Delta table on each micro-batch.

Skills: databricks-zerobus-ingest MCP Tools: execute_sql, get_table_details

Build a Python Zerobus Ingest producer that streams events directly into the
Delta table main.raw.click_events using gRPC.
Generate a Protobuf schema from my Unity Catalog table main.events.user_actions
for use with Zerobus Ingest.
Create a Zerobus Ingest client in TypeScript that writes IoT sensor readings to
main.iot.sensor_data with ACK handling and retry logic.
Set up a production Zerobus Ingest pipeline with proper error handling,
backpressure management, and dead-letter routing.

Skills: spark-python-data-source MCP Tools: execute_databricks_command, run_python_file_on_databricks

Build a custom PySpark DataSource that reads data from a REST API, handling
pagination and authentication, and returns it as a DataFrame.
Create a Spark DataSource writer that takes a DataFrame and pushes it to an
external PostgreSQL database.
Write a streaming DataSource reader that connects to a WebSocket endpoint and
yields new records as they arrive.
Build a PySpark DataSource for reading data from MongoDB with partition-aware
reads and predicate pushdown.