External Engine Interoperability

Skill: databricks-iceberg

What You Can Build

Any Iceberg-compatible engine can connect to Databricks through the IRC endpoint and work with your Unity Catalog tables directly. PyIceberg for lightweight Python reads and writes without Spark. OSS Spark for existing Spark clusters outside Databricks. EMR, Flink, Kafka Connect for streaming pipelines. The IRC handles auth and credential vending — you configure the catalog connection once and use standard Iceberg APIs from there.

In Action

“Write Python to connect PyIceberg to a Databricks workspace, read a table with a filter, and append new rows.”

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "uc",
    uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
    warehouse="analytics",
    token="dapi_your_pat_token",
)

# Read with pushdown filter
tbl = catalog.load_table("gold.order_events")
df = tbl.scan(
    row_filter="order_date >= '2025-01-01'",
    limit=1000,
).to_pandas()

# Inspect schema and snapshot
print(tbl)
print(tbl.current_snapshot())

Key decisions:

warehouse pins the UC catalog — all subsequent table identifiers use schema.table, not catalog.schema.table
PyIceberg reads go directly to cloud storage — the IRC vends temporary credentials, so PyIceberg reads Parquet files from S3/ADLS/GCS without going through Databricks compute
Writes only work on managed Iceberg tables — UniForm and Compatibility Mode tables are read-only from external engines
Upgrade pyarrow explicitly — the bundled pyarrow v15 on serverless causes write errors. Pin pyarrow>=17,<20 and install adlfs for Azure

More Patterns

Write data from PyIceberg

“Write Python to append rows to a Databricks-managed Iceberg table using PyIceberg and PyArrow.”

import pyarrow as pa
from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "uc",
    uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
    warehouse="analytics",
    token="dapi_your_pat_token",
)

tbl = catalog.load_table("gold.inventory")

# Schema must match exactly -- PyArrow defaults to int64, cast explicitly if table uses int32
arrow_schema = pa.schema([
    pa.field("id",   pa.int32()),
    pa.field("name", pa.string()),
    pa.field("qty",  pa.int32()),
])

rows = [
    {"id": 1, "name": "widget-a", "qty": 100},
    {"id": 2, "name": "widget-b", "qty": 250},
]
arrow_tbl = pa.Table.from_pylist(rows, schema=arrow_schema)
tbl.append(arrow_tbl)

PyArrow defaults to int64 for integers. If the Iceberg table schema uses int32, the append fails with a schema mismatch. Always define an explicit Arrow schema that matches the table definition.

Connect OSS Spark with OAuth

“Write Python to configure an external Spark session to read and write Databricks-managed Iceberg tables using OAuth.”

from pyspark.sql import SparkSession

WORKSPACE_URL       = "https://my-workspace.cloud.databricks.com"
UC_CATALOG_NAME     = "analytics"
OAUTH_CLIENT_ID     = "your-client-id"
OAUTH_CLIENT_SECRET = "your-client-secret"
ICEBERG_VER         = "1.7.1"

RUNTIME      = f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VER}"
CLOUD_BUNDLE = f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VER}"

spark = (
    SparkSession.builder
    .appName("uc-iceberg")
    .config("spark.jars.packages", f"{RUNTIME},{CLOUD_BUNDLE}")
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.uc",
            "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.uc.type", "rest")
    .config("spark.sql.catalog.uc.rest.auth.type", "oauth2")
    .config("spark.sql.catalog.uc.uri",
            f"{WORKSPACE_URL}/api/2.1/unity-catalog/iceberg-rest")
    .config("spark.sql.catalog.uc.oauth2-server-uri",
            f"{WORKSPACE_URL}/oidc/v1/token")
    .config("spark.sql.catalog.uc.credential",
            f"{OAUTH_CLIENT_ID}:{OAUTH_CLIENT_SECRET}")
    .config("spark.sql.catalog.uc.scope", "all-apis")
    .config("spark.sql.catalog.uc.warehouse", UC_CATALOG_NAME)
    .getOrCreate()
)

# Query via Spark SQL
spark.sql("SELECT * FROM uc.gold.order_events").show()

# Write (managed Iceberg tables only)
df.writeTo("uc.gold.order_events").append()

Two JARs are required: the Spark runtime and a cloud-specific bundle. Use iceberg-aws-bundle for AWS, iceberg-azure-bundle for Azure, or iceberg-gcp-bundle for GCP. This configuration is for Spark outside Databricks only — inside DBR, use the built-in Iceberg support and never install the Iceberg library.

Query from Spark SQL after session setup

“Show the Spark SQL statements for listing schemas, querying, and inserting into a Databricks-managed Iceberg table.”

-- List schemas
SHOW NAMESPACES IN uc;

-- Query
SELECT * FROM uc.gold.order_events
WHERE order_date >= '2025-01-01';

-- Insert (managed Iceberg tables only)
INSERT INTO uc.gold.order_events VALUES (42, 'purchase', '2025-07-01', '{}');

Once the Spark session is configured with the IRC catalog, standard Spark SQL works as expected. The catalog alias (uc in this example) is arbitrary — pick any name that makes sense for your context.

Watch Out For

Never install Iceberg JARs inside Databricks Runtime — DBR includes built-in support. Adding a library causes class conflicts. This configuration is strictly for external Spark clusters.
403 Forbidden usually means IP access lists, not credentials — if the workspace has IP access lists enabled, add the client’s egress CIDR to the allowlist before debugging auth. Check via Admin Console under Settings, Security, IP access list.
PyArrow schema mismatches fail silently on reads, loudly on writes — reads may cast types automatically, but writes require exact type matching. Always define explicit Arrow schemas.
Iceberg v3 tables require library 1.9.0+ — older Iceberg client libraries cannot read v3 tables. If you get unexpected errors on a table that recently upgraded, check the client library version.