External Engine Interop

Skill: databricks-iceberg

What You Can Build

Any engine that speaks the Iceberg REST Catalog protocol can read (and for managed Iceberg tables, write) your Databricks data. PyIceberg for lightweight Python scripts, OSS Spark for non-Databricks clusters, EMR for AWS-native workflows, Flink for stream processing — they all connect to the same IRC endpoint, authenticate, and get vended credentials for direct storage access. You configure it once per engine and every Unity Catalog table is available.

In Action

“Using Python, connect PyIceberg to a Databricks workspace and scan an Iceberg table with a filter pushdown.”

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "databricks",
    uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
    warehouse="analytics",
    token="dapi_your_pat_token",
)

table = catalog.load_table("gold.orders")
df = table.scan(
    row_filter="order_date >= '2025-01-01' AND region = 'us-east-1'",
    limit=5000,
).to_pandas()

Key decisions:

warehouse pins the Unity Catalog catalog, so table identifiers use schema.table format
Filter pushdown (row_filter) prunes at the file level — only matching data files are read from storage
PAT authentication works for development; use OAuth service principals for production
The service principal needs EXTERNAL USE SCHEMA plus SELECT (and MODIFY for writes) on the target schema

More Patterns

Write from PyIceberg with explicit Arrow types

“Using Python, append data to a managed Iceberg table from PyIceberg with correct type casting.”

import pyarrow as pa
from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "databricks",
    uri="https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest",
    warehouse="analytics",
    token="dapi_your_pat_token",
)

table = catalog.load_table("gold.orders")

# Match the Iceberg schema exactly -- PyArrow defaults to int64
arrow_schema = pa.schema([
    pa.field("order_id", pa.int64()),
    pa.field("customer_id", pa.int64()),
    pa.field("amount", pa.decimal128(10, 2)),
    pa.field("region", pa.string()),
    pa.field("order_date", pa.date32()),
])

new_rows = pa.Table.from_pylist([
    {"order_id": 5001, "customer_id": 77, "amount": 349.99,
     "region": "eu-west-1", "order_date": "2025-07-01"},
], schema=arrow_schema)

table.append(new_rows)

Only managed Iceberg tables accept external writes. UniForm and Compatibility Mode tables are read-only from external engines because the underlying format is Delta.

OSS Spark with OAuth authentication

“Using Python, configure an open-source Spark cluster to read and write Databricks Iceberg tables.”

from pyspark.sql import SparkSession

WORKSPACE = "https://my-workspace.cloud.databricks.com"
CATALOG_ALIAS = "uc"

spark = (
    SparkSession.builder
    .appName("external-iceberg")
    .config("spark.jars.packages",
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,"
            "org.apache.iceberg:iceberg-aws-bundle:1.7.1")
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}",
            "org.apache.iceberg.spark.SparkCatalog")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}.type", "rest")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}.uri",
            f"{WORKSPACE}/api/2.1/unity-catalog/iceberg-rest")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}.rest.auth.type", "oauth2")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}.oauth2-server-uri",
            f"{WORKSPACE}/oidc/v1/token")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}.credential",
            "<client-id>:<client-secret>")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}.scope", "all-apis")
    .config(f"spark.sql.catalog.{CATALOG_ALIAS}.warehouse", "analytics")
    .getOrCreate()
)

# Read
spark.sql("SELECT * FROM uc.gold.orders LIMIT 10").show()

# Write (managed Iceberg only)
df.writeTo("uc.gold.orders").append()

Choose the cloud bundle matching your metastore: iceberg-aws-bundle for AWS, iceberg-azure-bundle for Azure, iceberg-gcp-bundle for GCP. This configuration is for Spark outside Databricks — inside DBR, use the built-in support.

PyIceberg config via YAML

“Set up a pyiceberg.yaml config file so scripts connect without inline credentials.”

catalog:
  databricks:
    type: rest
    uri: https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest
    warehouse: analytics
    credential: dapi_your_pat_token

Place this at ~/.pyiceberg.yaml or set the PYICEBERG_CATALOG__DATABRICKS__URI (and related) environment variables. Scripts then call load_catalog("databricks") without passing connection details inline.

Watch Out For

IP access lists are the first thing to check — if connections time out or return 403 even with valid credentials, the external engine’s egress IP is not on the workspace allowlist. This is the most common setup issue.
Do not install Iceberg JARs inside Databricks Runtime — DBR includes built-in Iceberg support. Adding a library causes class conflicts. Only configure external Iceberg JARs on non-Databricks engines.
PyArrow version matters — PyIceberg requires pyarrow>=17. The default pyarrow v15 bundled in some environments causes write errors. Upgrade explicitly: pip install "pyarrow>=17,<20".
Schema type mismatches cause silent failures — PyArrow defaults to int64 for all integers. If your Iceberg table uses int32 columns, the append succeeds but downstream reads may break. Always build an explicit Arrow schema matching the table.
v3 tables require Iceberg library 1.9.0+ — external engines on older library versions can’t read format-version 3 tables. Check client versions before upgrading your tables to v3.