databricks-common-errors

📁 jeremylongshore/claude-code-plugins-plus 📅 Feb 3, 2026
9
总安装量
9
周安装量
#31831
全站排名
安装命令
npx skills add https://github.com/jeremylongshore/claude-code-plugins-plus --skill databricks-common-errors

Agent 安装分布

claude-code 9
mcpjam 8
antigravity 8
junie 8
windsurf 8
zencoder 8

Skill 文档

Databricks Common Errors

Overview

Quick reference for the top Databricks errors and their solutions.

Prerequisites

  • Databricks CLI/SDK installed
  • API credentials configured
  • Access to cluster/job logs

Instructions

Step 1: Identify the Error

Check error message in job run output, cluster logs, or notebook cells.

Step 2: Find Matching Error Below

Match your error to one of the documented cases.

Step 3: Apply Solution

Follow the solution steps for your specific error.

Output

  • Identified error cause
  • Applied fix
  • Verified resolution

Error Handling

CLUSTER_NOT_READY

Error Message:

ClusterNotReadyException: Cluster is not in a valid state

Cause: Cluster is starting, terminating, or in error state.

Solution:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import State

w = WorkspaceClient()
cluster = w.clusters.get("cluster-id")

if cluster.state in [State.PENDING, State.RESTARTING]:
    # Wait for cluster
    w.clusters.wait_get_cluster_running("cluster-id")
elif cluster.state == State.TERMINATED:
    # Start cluster
    w.clusters.start("cluster-id")
    w.clusters.wait_get_cluster_running("cluster-id")
elif cluster.state == State.ERROR:
    # Check termination reason
    print(f"Error: {cluster.termination_reason}")

SPARK_DRIVER_OOM (Out of Memory)

Error Message:

java.lang.OutOfMemoryError: Java heap space
SparkException: Job aborted due to stage failure

Cause: Driver or executor running out of memory.

Solution:

# Increase driver memory in cluster config
{
    "spark.driver.memory": "8g",
    "spark.executor.memory": "8g",
    "spark.sql.shuffle.partitions": "200"
}

# Or use more efficient operations
# WRONG: collect() on large data
all_data = df.collect()  # DON'T DO THIS

# RIGHT: process in chunks or use distributed operations
df.write.format("delta").save("/path")  # Keep data distributed

DELTA_CONCURRENT_WRITE

Error Message:

ConcurrentAppendException: Files were added by a concurrent update
ConcurrentDeleteReadException: A concurrent operation modified files

Cause: Multiple jobs writing to same Delta table simultaneously.

Solution:

# Option 1: Retry with isolation level
df.write \
    .format("delta") \
    .option("isolationLevel", "Serializable") \
    .mode("append") \
    .save("/path")

# Option 2: Use merge with retry logic
from delta.tables import DeltaTable
import time

def merge_with_retry(source_df, target_path, merge_condition, retries=3):
    for attempt in range(retries):
        try:
            delta_table = DeltaTable.forPath(spark, target_path)
            delta_table.alias("t").merge(
                source_df.alias("s"),
                merge_condition
            ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
            return
        except Exception as e:
            if "Concurrent" in str(e) and attempt < retries - 1:
                time.sleep(2 ** attempt)
                continue
            raise

PERMISSION_DENIED

Error Message:

PermissionDeniedException: User does not have permission
PERMISSION_DENIED: User does not have READ on table

Cause: Missing Unity Catalog or workspace permissions.

Solution:

-- Grant table permissions (Unity Catalog)
GRANT SELECT ON TABLE catalog.schema.table TO `user@company.com`;
GRANT ALL PRIVILEGES ON SCHEMA catalog.schema TO `data-team`;

-- Check current permissions
SHOW GRANTS ON TABLE catalog.schema.table;

-- For workspace objects (Admin required)
databricks permissions update jobs --job-id 123 --json '{
  "access_control_list": [{
    "user_name": "user@company.com",
    "permission_level": "CAN_MANAGE_RUN"
  }]
}'

INVALID_PARAMETER_VALUE

Error Message:

InvalidParameterValue: Instance type not supported
Invalid Spark version

Cause: Wrong cluster configuration for workspace/cloud.

Solution:

# Get valid node types for your workspace
w = WorkspaceClient()
node_types = list(w.clusters.list_node_types())
for nt in node_types[:5]:
    print(f"{nt.node_type_id}: {nt.memory_mb}MB, {nt.num_cores} cores")

# Get valid Spark versions
versions = list(w.clusters.spark_versions())
for v in versions[:5]:
    print(v.key)

SCHEMA_MISMATCH

Error Message:

AnalysisException: Cannot merge incompatible data types
Delta table schema does not match

Cause: Source data schema doesn’t match target table.

Solution:

# Option 1: Enable schema evolution
df.write \
    .format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("/path")

# Option 2: Explicit schema alignment
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

target_schema = spark.table("target_table").schema

# Cast columns to match
for field in target_schema:
    if field.name in df.columns:
        df = df.withColumn(field.name, col(field.name).cast(field.dataType))

# Option 3: Check differences before write
source_cols = set(df.columns)
target_cols = set(spark.table("target").columns)
print(f"Missing in source: {target_cols - source_cols}")
print(f"Extra in source: {source_cols - target_cols}")

RATE_LIMIT_EXCEEDED

Error Message:

RateLimitExceeded: Too many requests
HTTP 429: Rate limit exceeded

Cause: Too many API calls in short period.

Solution:

# See databricks-rate-limits skill for full implementation
from databricks.sdk.errors import TooManyRequests
import time

def api_call_with_backoff(operation, max_retries=5):
    for attempt in range(max_retries):
        try:
            return operation()
        except TooManyRequests:
            delay = 2 ** attempt
            print(f"Rate limited, waiting {delay}s...")
            time.sleep(delay)
    raise Exception("Max retries exceeded")

JOB_RUN_FAILED

Error Message:

RunState: FAILED
Run terminated with error: Task failed with error

Cause: Various – check run details for specific error.

Solution:

# Get detailed run info
databricks runs get --run-id 12345

# Get run output (stdout/stderr)
databricks runs get-output --run-id 12345

# Common fixes by task type:
# - Notebook: Check cell output for exception
# - Python: Check stderr for traceback
# - JAR: Check cluster driver logs
# - SQL: Check query execution details
# Programmatic debugging
w = WorkspaceClient()
run = w.jobs.get_run(run_id=12345)

print(f"State: {run.state.life_cycle_state}")
print(f"Result: {run.state.result_state}")
print(f"Message: {run.state.state_message}")

for task in run.tasks:
    print(f"Task {task.task_key}: {task.state.result_state}")
    if task.state.result_state == "FAILED":
        output = w.jobs.get_run_output(task.run_id)
        print(f"Error: {output.error}")

Examples

Quick Diagnostic Commands

# Check cluster status
databricks clusters get --cluster-id abc123

# Get recent job runs
databricks runs list --job-id 456 --limit 5

# Check workspace permissions
databricks permissions get jobs --job-id 456

# Validate cluster config
databricks clusters create --json '{"cluster_name":"test",...}' --dry-run

Escalation Path

  1. Collect evidence with databricks-debug-bundle
  2. Check Databricks Status
  3. Search Databricks Community
  4. Contact support with workspace ID and request ID

Resources

Next Steps

For comprehensive debugging, see databricks-debug-bundle.