External Engine Interoperability
Skill: databricks-iceberg
What You Can Build
Section titled “What You Can Build”Any Iceberg-compatible engine can connect to Databricks through the IRC endpoint and work with your Unity Catalog tables directly. PyIceberg for lightweight Python reads and writes without Spark. OSS Spark for existing Spark clusters outside Databricks. EMR, Flink, Kafka Connect for streaming pipelines. The IRC handles auth and credential vending — you configure the catalog connection once and use standard Iceberg APIs from there.
In Action
Section titled “In Action”“Write Python to connect PyIceberg to a Databricks workspace, read a table with a filter, and append new rows.”
from pyiceberg.catalog import load_catalog
catalog = load_catalog( "uc", uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest", warehouse="analytics", token="dapi_your_pat_token",)
# Read with pushdown filtertbl = catalog.load_table("gold.order_events")df = tbl.scan( row_filter="order_date >= '2025-01-01'", limit=1000,).to_pandas()
# Inspect schema and snapshotprint(tbl)print(tbl.current_snapshot())Key decisions:
warehousepins the UC catalog — all subsequent table identifiers useschema.table, notcatalog.schema.table- PyIceberg reads go directly to cloud storage — the IRC vends temporary credentials, so PyIceberg reads Parquet files from S3/ADLS/GCS without going through Databricks compute
- Writes only work on managed Iceberg tables — UniForm and Compatibility Mode tables are read-only from external engines
- Upgrade pyarrow explicitly — the bundled pyarrow v15 on serverless causes write errors. Pin
pyarrow>=17,<20and installadlfsfor Azure
More Patterns
Section titled “More Patterns”Write data from PyIceberg
Section titled “Write data from PyIceberg”“Write Python to append rows to a Databricks-managed Iceberg table using PyIceberg and PyArrow.”
import pyarrow as pafrom pyiceberg.catalog import load_catalog
catalog = load_catalog( "uc", uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest", warehouse="analytics", token="dapi_your_pat_token",)
tbl = catalog.load_table("gold.inventory")
# Schema must match exactly -- PyArrow defaults to int64, cast explicitly if table uses int32arrow_schema = pa.schema([ pa.field("id", pa.int32()), pa.field("name", pa.string()), pa.field("qty", pa.int32()),])
rows = [ {"id": 1, "name": "widget-a", "qty": 100}, {"id": 2, "name": "widget-b", "qty": 250},]arrow_tbl = pa.Table.from_pylist(rows, schema=arrow_schema)tbl.append(arrow_tbl)PyArrow defaults to int64 for integers. If the Iceberg table schema uses int32, the append fails with a schema mismatch. Always define an explicit Arrow schema that matches the table definition.
Connect OSS Spark with OAuth
Section titled “Connect OSS Spark with OAuth”“Write Python to configure an external Spark session to read and write Databricks-managed Iceberg tables using OAuth.”
from pyspark.sql import SparkSession
WORKSPACE_URL = "https://my-workspace.cloud.databricks.com"UC_CATALOG_NAME = "analytics"OAUTH_CLIENT_ID = "your-client-id"OAUTH_CLIENT_SECRET = "your-client-secret"ICEBERG_VER = "1.7.1"
RUNTIME = f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VER}"CLOUD_BUNDLE = f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VER}"
spark = ( SparkSession.builder .appName("uc-iceberg") .config("spark.jars.packages", f"{RUNTIME},{CLOUD_BUNDLE}") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.uc", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.uc.type", "rest") .config("spark.sql.catalog.uc.rest.auth.type", "oauth2") .config("spark.sql.catalog.uc.uri", f"{WORKSPACE_URL}/api/2.1/unity-catalog/iceberg-rest") .config("spark.sql.catalog.uc.oauth2-server-uri", f"{WORKSPACE_URL}/oidc/v1/token") .config("spark.sql.catalog.uc.credential", f"{OAUTH_CLIENT_ID}:{OAUTH_CLIENT_SECRET}") .config("spark.sql.catalog.uc.scope", "all-apis") .config("spark.sql.catalog.uc.warehouse", UC_CATALOG_NAME) .getOrCreate())
# Query via Spark SQLspark.sql("SELECT * FROM uc.gold.order_events").show()
# Write (managed Iceberg tables only)df.writeTo("uc.gold.order_events").append()Two JARs are required: the Spark runtime and a cloud-specific bundle. Use iceberg-aws-bundle for AWS, iceberg-azure-bundle for Azure, or iceberg-gcp-bundle for GCP. This configuration is for Spark outside Databricks only — inside DBR, use the built-in Iceberg support and never install the Iceberg library.
Query from Spark SQL after session setup
Section titled “Query from Spark SQL after session setup”“Show the Spark SQL statements for listing schemas, querying, and inserting into a Databricks-managed Iceberg table.”
-- List schemasSHOW NAMESPACES IN uc;
-- QuerySELECT * FROM uc.gold.order_eventsWHERE order_date >= '2025-01-01';
-- Insert (managed Iceberg tables only)INSERT INTO uc.gold.order_events VALUES (42, 'purchase', '2025-07-01', '{}');Once the Spark session is configured with the IRC catalog, standard Spark SQL works as expected. The catalog alias (uc in this example) is arbitrary — pick any name that makes sense for your context.
Watch Out For
Section titled “Watch Out For”- Never install Iceberg JARs inside Databricks Runtime — DBR includes built-in support. Adding a library causes class conflicts. This configuration is strictly for external Spark clusters.
- 403 Forbidden usually means IP access lists, not credentials — if the workspace has IP access lists enabled, add the client’s egress CIDR to the allowlist before debugging auth. Check via Admin Console under Settings, Security, IP access list.
- PyArrow schema mismatches fail silently on reads, loudly on writes — reads may cast types automatically, but writes require exact type matching. Always define explicit Arrow schemas.
- Iceberg v3 tables require library 1.9.0+ — older Iceberg client libraries cannot read v3 tables. If you get unexpected errors on a table that recently upgraded, check the client library version.