Skip to content

External Engine Interoperability

Skill: databricks-iceberg

Any Iceberg-compatible engine can connect to Databricks through the IRC endpoint and work with your Unity Catalog tables directly. PyIceberg for lightweight Python reads and writes without Spark. OSS Spark for existing Spark clusters outside Databricks. EMR, Flink, Kafka Connect for streaming pipelines. The IRC handles auth and credential vending — you configure the catalog connection once and use standard Iceberg APIs from there.

“Write Python to connect PyIceberg to a Databricks workspace, read a table with a filter, and append new rows.”

from pyiceberg.catalog import load_catalog
catalog = load_catalog(
"uc",
uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
warehouse="analytics",
token="dapi_your_pat_token",
)
# Read with pushdown filter
tbl = catalog.load_table("gold.order_events")
df = tbl.scan(
row_filter="order_date >= '2025-01-01'",
limit=1000,
).to_pandas()
# Inspect schema and snapshot
print(tbl)
print(tbl.current_snapshot())

Key decisions:

  • warehouse pins the UC catalog — all subsequent table identifiers use schema.table, not catalog.schema.table
  • PyIceberg reads go directly to cloud storage — the IRC vends temporary credentials, so PyIceberg reads Parquet files from S3/ADLS/GCS without going through Databricks compute
  • Writes only work on managed Iceberg tables — UniForm and Compatibility Mode tables are read-only from external engines
  • Upgrade pyarrow explicitly — the bundled pyarrow v15 on serverless causes write errors. Pin pyarrow>=17,<20 and install adlfs for Azure

“Write Python to append rows to a Databricks-managed Iceberg table using PyIceberg and PyArrow.”

import pyarrow as pa
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
"uc",
uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
warehouse="analytics",
token="dapi_your_pat_token",
)
tbl = catalog.load_table("gold.inventory")
# Schema must match exactly -- PyArrow defaults to int64, cast explicitly if table uses int32
arrow_schema = pa.schema([
pa.field("id", pa.int32()),
pa.field("name", pa.string()),
pa.field("qty", pa.int32()),
])
rows = [
{"id": 1, "name": "widget-a", "qty": 100},
{"id": 2, "name": "widget-b", "qty": 250},
]
arrow_tbl = pa.Table.from_pylist(rows, schema=arrow_schema)
tbl.append(arrow_tbl)

PyArrow defaults to int64 for integers. If the Iceberg table schema uses int32, the append fails with a schema mismatch. Always define an explicit Arrow schema that matches the table definition.

“Write Python to configure an external Spark session to read and write Databricks-managed Iceberg tables using OAuth.”

from pyspark.sql import SparkSession
WORKSPACE_URL = "https://my-workspace.cloud.databricks.com"
UC_CATALOG_NAME = "analytics"
OAUTH_CLIENT_ID = "your-client-id"
OAUTH_CLIENT_SECRET = "your-client-secret"
ICEBERG_VER = "1.7.1"
RUNTIME = f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VER}"
CLOUD_BUNDLE = f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VER}"
spark = (
SparkSession.builder
.appName("uc-iceberg")
.config("spark.jars.packages", f"{RUNTIME},{CLOUD_BUNDLE}")
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.uc",
"org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.uc.type", "rest")
.config("spark.sql.catalog.uc.rest.auth.type", "oauth2")
.config("spark.sql.catalog.uc.uri",
f"{WORKSPACE_URL}/api/2.1/unity-catalog/iceberg-rest")
.config("spark.sql.catalog.uc.oauth2-server-uri",
f"{WORKSPACE_URL}/oidc/v1/token")
.config("spark.sql.catalog.uc.credential",
f"{OAUTH_CLIENT_ID}:{OAUTH_CLIENT_SECRET}")
.config("spark.sql.catalog.uc.scope", "all-apis")
.config("spark.sql.catalog.uc.warehouse", UC_CATALOG_NAME)
.getOrCreate()
)
# Query via Spark SQL
spark.sql("SELECT * FROM uc.gold.order_events").show()
# Write (managed Iceberg tables only)
df.writeTo("uc.gold.order_events").append()

Two JARs are required: the Spark runtime and a cloud-specific bundle. Use iceberg-aws-bundle for AWS, iceberg-azure-bundle for Azure, or iceberg-gcp-bundle for GCP. This configuration is for Spark outside Databricks only — inside DBR, use the built-in Iceberg support and never install the Iceberg library.

“Show the Spark SQL statements for listing schemas, querying, and inserting into a Databricks-managed Iceberg table.”

-- List schemas
SHOW NAMESPACES IN uc;
-- Query
SELECT * FROM uc.gold.order_events
WHERE order_date >= '2025-01-01';
-- Insert (managed Iceberg tables only)
INSERT INTO uc.gold.order_events VALUES (42, 'purchase', '2025-07-01', '{}');

Once the Spark session is configured with the IRC catalog, standard Spark SQL works as expected. The catalog alias (uc in this example) is arbitrary — pick any name that makes sense for your context.

  • Never install Iceberg JARs inside Databricks Runtime — DBR includes built-in support. Adding a library causes class conflicts. This configuration is strictly for external Spark clusters.
  • 403 Forbidden usually means IP access lists, not credentials — if the workspace has IP access lists enabled, add the client’s egress CIDR to the allowlist before debugging auth. Check via Admin Console under Settings, Security, IP access list.
  • PyArrow schema mismatches fail silently on reads, loudly on writes — reads may cast types automatically, but writes require exact type matching. Always define explicit Arrow schemas.
  • Iceberg v3 tables require library 1.9.0+ — older Iceberg client libraries cannot read v3 tables. If you get unexpected errors on a table that recently upgraded, check the client library version.