Protobuf Schema
Skill: databricks-zerobus-ingest
What You Can Build
Section titled “What You Can Build”A type-safe ingestion pipeline where your Protobuf schema is generated directly from the Unity Catalog table definition. Instead of hand-writing .proto files and hoping they match your Delta table, you run the SDK’s schema generator against the table metadata. The result is a .proto file with exact field-to-column mapping, compiled into language bindings for Python, Java, Go, or Rust.
In Action
Section titled “In Action”“Define a Protobuf schema for my event data and set up Zerobus to deserialize it automatically. I’m using Python.”
# Generate .proto from the Unity Catalog table schemapython -m zerobus.tools.generate_proto \ --uc-endpoint "https://dbc-a1b2c3d4-e5f6.cloud.databricks.com" \ --client-id "$DATABRICKS_CLIENT_ID" \ --client-secret "$DATABRICKS_CLIENT_SECRET" \ --table "catalog.schema.my_events" \ --output record.proto# Generated output — matches the table schema exactlysyntax = "proto3";
message MyEvents { string event_id = 1; string device_name = 2; int32 temp = 3; int64 humidity = 4; int64 event_time = 5;}# Compile to Python bindingspip install grpcio-toolspython -m grpc_tools.protoc -I. --python_out=. record.proto# Use the compiled module in your producerimport record_pb2from zerobus.sdk.shared import TableProperties
table_props = TableProperties( "catalog.schema.my_events", record_pb2.MyEvents.DESCRIPTOR,)Key decisions:
- The generator reads column names and types from Unity Catalog metadata — no manual
.protoauthoring needed - The
DESCRIPTORobject from the compiled module tells Zerobus how to validate and deserialize each field TIMESTAMPcolumns becomeint64(epoch microseconds), not strings — this is the most common source of ingestion errors- The
.protois a one-time generation per table version — regenerate only when the table schema changes
More Patterns
Section titled “More Patterns”Generate schema using the Java SDK tool
Section titled “Generate schema using the Java SDK tool”“Generate a Protobuf schema from my Unity Catalog table using the Java SDK and compile Java bindings.”
# Generate .proto from UC tablejava -jar zerobus-ingest-sdk-0.1.0-jar-with-dependencies.jar \ --uc-endpoint "https://dbc-a1b2c3d4-e5f6.cloud.databricks.com" \ --client-id "$DATABRICKS_CLIENT_ID" \ --client-secret "$DATABRICKS_CLIENT_SECRET" \ --table "catalog.schema.my_events" \ --output record.proto
# Compile to Java classesprotoc --java_out=src/main/java record.protoimport com.example.proto.Record.MyEvents;
MyEvents record = MyEvents.newBuilder() .setDeviceName("sensor-1") .setTemp(22) .setHumidity(55L) .build();Both the Python and Java generators produce identical .proto output. The difference is only in how you invoke the tool — Python uses a module entry point, Java uses the SDK’s fat JAR.
Compile bindings for Go and Rust
Section titled “Compile bindings for Go and Rust”“Compile my generated .proto file for Go and Rust projects.”
# Go bindingsprotoc --go_out=. record.proto// Rust: use prost in build.rsfn main() { prost_build::compile_protos(&["record.proto"], &["."]).unwrap();}Go and Rust don’t have built-in .proto generators — you generate the .proto with the Python or Java tool, then compile bindings with protoc (Go) or prost (Rust). The output is identical regardless of which tool generated the .proto.
Handle schema evolution after a table change
Section titled “Handle schema evolution after a table change”“My Unity Catalog table has a new column. Update my Protobuf schema and producer code to match.”
# 1. Regenerate the .proto from the updated tablepython -m zerobus.tools.generate_proto \ --uc-endpoint "https://dbc-a1b2c3d4-e5f6.cloud.databricks.com" \ --client-id "$DATABRICKS_CLIENT_ID" \ --client-secret "$DATABRICKS_CLIENT_SECRET" \ --table "catalog.schema.my_events" \ --output record.proto
# 2. Recompile language bindingspython -m grpc_tools.protoc -I. --python_out=. record.proto
# 3. Update producer code to populate the new field, then redeployZerobus does not support automatic schema evolution. When you add, remove, or change a column in Unity Catalog, you must regenerate the .proto, recompile, update your producer to populate new fields, and redeploy. Skipping any step leads to either a schema mismatch error or silently dropped fields.
Watch Out For
Section titled “Watch Out For”- TIMESTAMP is epoch microseconds, not a string — the generator maps
TIMESTAMPtoint64. Your producer must passint(time.time() * 1_000_000)in Python or equivalent in other languages. Passing an ISO string or epoch seconds causes ingestion failures or wrong values. - The .proto must match the table schema exactly — if you add a column to the table but keep the old
.proto, new records will be missing that field. If you add a field to the.protothat doesn’t exist in the table, ingestion fails with a schema mismatch. - DECIMAL maps to bytes or string, not a numeric type — check your generated
.protofor the exact mapping. The Protobuf type depends on the precision and scale of the Delta column. - MAP keys must be string or integer — Protobuf’s
map<K,V>only supports string or integer key types. If your Delta table has aMAPcolumn with a non-string, non-integer key, the generator will fail or produce an unusable schema. - Max 2,000 columns per schema, 10 MB per message — tables with wide schemas or large nested structs can hit these limits. If you’re close, consider splitting into multiple tables or flattening nested structures.
- grpcio-tools version conflicts with Databricks runtime — if generating schemas on a Databricks cluster, check the runtime’s protobuf version first and install a compatible
grpcio-tools. Mismatched versions cause import errors that are hard to trace.