Data Exploration
Skill: databricks-config
What You Can Build
Section titled “What You Can Build”Before you build pipelines or dashboards, you need to know what data exists and what it looks like. You can use the Databricks CLI to search information_schema by keyword, inspect column types and partitioning, and run ad-hoc SQL queries — all without leaving the terminal. This is the fastest path from “where’s the customer data?” to “here’s the schema and a sample.”
In Action
Section titled “In Action”“Find all tables with ‘customer’ in the name across every catalog I have access to, then show me the columns for the best match.”
# Search for tables by keyworddatabricks sql execute --warehouse-id <WH_ID> \ --statement " SELECT table_catalog, table_schema, table_name FROM system.information_schema.tables WHERE LOWER(table_name) LIKE '%customer%' ORDER BY table_catalog, table_schema, table_name " --profile <PROFILE>
# Inspect the schema of a specific tabledatabricks sql execute --warehouse-id <WH_ID> \ --statement "DESCRIBE TABLE EXTENDED main.sales.customers" \ --profile <PROFILE>Key decisions:
system.information_schema.tablessearches across all catalogs you have access to — this is the right starting point when you don’t know which catalog contains your dataDESCRIBE TABLE EXTENDEDreturns column names, types, nullability, partitioning, and table properties in a single call- The
--warehouse-idflag routes the query to a SQL warehouse (not a cluster), which is faster and cheaper for metadata lookups
More Patterns
Section titled “More Patterns”Find an Available Warehouse
Section titled “Find an Available Warehouse”“List SQL warehouses in my workspace to find one that’s running.”
# List all warehouses with their statusdatabricks warehouses list --profile <PROFILE>
# Or get the default warehouse for AI toolsdatabricks experimental aitools tools get-default-warehouse --profile <PROFILE>If you don’t know your warehouse ID, warehouses list shows all available warehouses with their IDs and current state. Pick one that’s RUNNING or STARTING to avoid cold-start wait times.
Run Ad-Hoc Queries
Section titled “Run Ad-Hoc Queries”“Pull a 10-row sample from a table to check what the data looks like.”
databricks sql execute --warehouse-id <WH_ID> \ --statement "SELECT * FROM main.sales.customers LIMIT 10" \ --profile <PROFILE>Ad-hoc queries are useful for sanity-checking data before writing transforms. The output format is JSON by default — pipe through jq for prettier formatting if needed.
Check Date Ranges Before Filtering
Section titled “Check Date Ranges Before Filtering”“Find the actual date range in a table so I don’t accidentally filter to an empty result set.”
databricks sql execute --warehouse-id <WH_ID> \ --statement " SELECT MIN(order_date) AS earliest, MAX(order_date) AS latest, COUNT(*) AS total_rows FROM main.sales.orders " --profile <PROFILE>Sample datasets and dev environments often have historical data rather than current-day data. Checking the date range first prevents the common mistake of filtering with CURRENT_DATE() against a dataset that ends six months ago.
Explore Column Statistics
Section titled “Explore Column Statistics”“Get value distributions for key columns to understand data quality before building a pipeline.”
databricks sql execute --warehouse-id <WH_ID> \ --statement " SELECT COUNT(*) AS total, COUNT(DISTINCT customer_id) AS unique_customers, COUNT(CASE WHEN email IS NULL THEN 1 END) AS null_emails, MIN(created_at) AS first_record, MAX(created_at) AS last_record FROM main.sales.customers " --profile <PROFILE>Running basic profiling queries before you design a pipeline saves time. Knowing the null rate, cardinality, and date range up front means fewer surprises when your transform logic hits production data.
Watch Out For
Section titled “Watch Out For”- Querying
information_schemawithout a warehouse —information_schemalives in thesystemcatalog and requires a SQL warehouse for execution. Cluster-based queries can’t access it. Always pass--warehouse-id. - Filtering with
CURRENT_DATE()on sample data — Demo and dev tables often contain historical data. AWHERE order_date = CURRENT_DATE()filter returns zero rows. Check date ranges first. - Forgetting
--profileon multi-workspace setups — Without--profile, the CLI uses default credentials, which may point to the wrong workspace. Always specify the profile when working with multiple environments. - Running expensive queries without
LIMIT— Ad-hoc exploration withoutLIMITcan accidentally scan a multi-terabyte table. Start withLIMIT 10and remove it only when you’ve confirmed the table size.