databricks-repl
npx skills add https://github.com/wedneyyuri/databricks-repl --skill databricks-repl
Agent 安装分布
Skill 文档
Databricks RLM REPL
Execute Python on a Databricks cluster through a stateful REPL that follows the Recursive Language Model (RLM) pattern. The root LM (Claude Code) never sees auth tokens, connection boilerplate, raw polling, or large dataframes â it interacts with Databricks through a clean CLI that returns JSON metadata and file paths.
Design Principles
| Principle | Meaning |
|---|---|
| Clean root context | Claude sees only JSON metadata and file paths â never tokens, boilerplate, or raw output |
| Append-only execution | Commands are cells in an ordered log â never edited, only appended â enabling lineage tracking and eviction replay |
| Output via files | stdout/stderr land as files. Claude reads selectively to control what enters its context |
sub_llm() for recursion |
On-cluster LLM calls for map/reduce, classification, summarization. No tools â pure text completions |
| Everything recorded | .cmd.py files + session.json steps = full lineage |
Constraints
- Classic all-purpose clusters only â the Command Execution API requires them
- Max 145 execution contexts per cluster â destroy unused sessions to free slots
- Idle contexts evicted permanently â the wrapper detects eviction and Claude replays steps
- Use
%pip installfor packages â%pipworks in execution contexts;%%pip(double percent) does not - Sub-LM calls are blocking â
sub_llm_batch()provides parallelism within a single command
When Invoked
Arguments: $0 = Databricks CLI profile from ~/.databrickscfg, $1 = cluster name (or substring).
-
Resolve the cluster ID from the cluster name:
python3 .claude/skills/databricks-repl/find_cluster.py --profile $0 --name "$1"Returns JSON with
cluster_id,cluster_name,state. If multiple matches, ask the user to pick. If the cluster is not running, inform the user. -
Check for an active session â look for
session.jsonin the project dir. If one exists and is not destroyed, reuse it. If destroyed or missing, create a new one:python3 .claude/skills/databricks-repl/dbx_repl.py create \ --cluster-id <RESOLVED_ID> --session-name <task-name> --profile $0 -
Execute the user’s code via
exec. Keep the session alive for follow-up commands. -
Read output selectively â use
stdout_previewfor small output, read the file for larger output. Present results to the user. -
Only destroy when the user says they’re done, or when switching to a different task.
Quick Start
# 1. Create a session (injects bootstrap helpers into the REPL)
python .claude/skills/databricks-repl/dbx_repl.py create --cluster-id $CLUSTER_ID --session-name my-task --profile $0
# 2. Execute code (submit + wait, returns JSON metadata)
python .claude/skills/databricks-repl/dbx_repl.py exec \
--command "df = spark.read.parquet('s3://bucket/data')" \
--tag "load_data"
# 3. Read output selectively (never dump full stdout)
head -20 repl_outputs/001_load_data.stdout
# 4. Iterate â execute more commands, read results, refine
python .claude/skills/databricks-repl/dbx_repl.py exec \
--command "print(df.describe().toPandas().to_string())" \
--tag "explore"
# 5. Destroy when done
python .claude/skills/databricks-repl/dbx_repl.py destroy
Core Workflow
- Create a session â one per task. The wrapper creates an execution context and injects bootstrap helpers.
- Exec commands sequentially â each returns JSON metadata with
stdout_file,stderr_file, andstdout_preview. Never see raw output in the response. - Read output files selectively â use
head,grep, or partial reads. Control what enters your context. - Persist results to Volumes using standard Python (
shutil.copy2,df.write.parquet,open().write()) when you need outputs to survive eviction. - Handle eviction if it occurs â the wrapper detects it and returns a structured response. Replay steps using idempotent patterns.
- Destroy the session when finished.
Core Principles
- You are the root LM. Reason about the task, decide what code to run, read outputs selectively. The wrapper handles auth tokens, SDK internals, and polling so they stay out of your context â no need to inspect them.
- Project directory = session scope. All commands use
--project-dir(defaults to cwd) to locatesession.jsonandrepl_outputs/. Each project directory is an independent session. Use--project-dirwhen managing multiple tasks from a single location. - Append-only execution. Each
execis a new cell in an ordered log. Append corrections rather than editing previous commands â the ordered log enables lineage tracking and eviction replay. - Output via files. The wrapper writes stdout/stderr to disk. Read only what you need.
sub_llm()for heavy lifting. When you need to classify rows, summarize partitions, or process text at scale, usesub_llm()inside the REPL â it calls a Databricks serving endpoint.- Use standard Python for persistence. Write files to Volumes with
shutil.copy2,df.write.parquet(),open().write(), etc. Track paths yourself for eviction recovery.
CLI Reference
All commands accept these common options:
| Option | Default | Description |
|---|---|---|
--project-dir |
. (cwd) |
Project root directory. session.json and repl_outputs/ live here. |
--profile |
SDK default | Databricks CLI profile name from ~/.databrickscfg |
--debug |
â | Print SDK-level debug information to stderr |
create
python .claude/skills/databricks-repl/dbx_repl.py create \
--cluster-id <CLUSTER_ID> \
--session-name <NAME>
| Parameter | Required | Default | Description |
|---|---|---|---|
--cluster-id |
Yes | â | Databricks cluster ID (classic all-purpose only) |
--session-name |
Yes | â | Human-readable session name (used in manifest) |
exec
# Inline code
python .claude/skills/databricks-repl/dbx_repl.py exec \
--command "df = spark.read.table('catalog.schema.table')" \
--tag "load_data"
# Code from file (prefix with @)
python .claude/skills/databricks-repl/dbx_repl.py exec \
--command @step3_train.py \
--tag "train"
| Parameter | Required | Default | Description |
|---|---|---|---|
--command |
Yes | â | Python code (inline string or @filepath) |
--tag |
Yes | â | Descriptive label for this step |
--timeout |
No | 600 | Max seconds to wait. If exceeded, returns Running status (command keeps running on cluster). |
Side effects per exec:
- Command saved to
{project-dir}/repl_outputs/{step}_{tag}.cmd.py - stdout/stderr written to
.stdout/.stderrfiles session.jsonsteps array updated with the new step
cancel
# Auto-detect active command from session.json (preferred)
python .claude/skills/databricks-repl/dbx_repl.py cancel
# Or specify a command ID explicitly
python .claude/skills/databricks-repl/dbx_repl.py cancel --run-id <COMMAND_ID>
| Parameter | Required | Default | Description |
|---|---|---|---|
--run-id |
No | auto-detect | Command ID to cancel. If omitted, reads active_command from session.json. |
await
Re-poll a command that is still running after a previous exec or await timed out.
# Re-poll with default timeout (600s)
python .claude/skills/databricks-repl/dbx_repl.py await
# Re-poll with custom timeout
python .claude/skills/databricks-repl/dbx_repl.py await --timeout 1200
| Parameter | Required | Default | Description |
|---|---|---|---|
--timeout |
No | 600 | Max seconds to wait before returning Running status again. |
Reads active_command from session.json. If the command has finished, writes output files and records the step. If still running, returns a Running response with a tip for next actions.
destroy
python .claude/skills/databricks-repl/dbx_repl.py destroy
Response Schemas
Exec success:
{
"cmd_id": "003_train",
"status": "Finished",
"stdout_file": "repl_outputs/003_train.stdout",
"stderr_file": "repl_outputs/003_train.stderr",
"stdout_bytes": 1247,
"stderr_bytes": 342,
"stdout_preview": "Training complete. F1: 0.87...",
"elapsed_seconds": 47.3
}
Exec error:
{
"cmd_id": "003_train",
"status": "Error",
"stdout_file": "repl_outputs/003_train.stdout",
"stderr_file": "repl_outputs/003_train.stderr",
"stdout_bytes": 0,
"stderr_bytes": 891,
"stderr_preview": "Traceback (most recent call last):\n File...",
"elapsed_seconds": 2.1
}
Exec interrupted (Ctrl+C):
{
"cmd_id": "003_train",
"status": "Cancelled",
"stdout_file": "repl_outputs/003_train.stdout",
"stderr_file": "repl_outputs/003_train.stderr",
"stdout_bytes": 0,
"stderr_bytes": 0,
"elapsed_seconds": 12.4,
"message": "Command interrupted by user."
}
Exec timeout (command still running):
{
"cmd_id": "003_train",
"status": "Running",
"command_id": "abc123-def456",
"elapsed_seconds": 600.0,
"message": "Command still running after 600s client-side timeout.",
"tip": {
"await": "dbx_repl.py await --timeout 600 --project-dir .",
"cancel": "dbx_repl.py cancel --project-dir ."
}
}
Cancel success:
{
"status": "cancelled",
"command_id": "abc123-def456"
}
Exec eviction detected: See references/eviction-replay.md.
Decision guide after exec:
stdout_bytessmall (< 500) â read the full filestdout_byteslarge â usestdout_previeworhead/grepon the filestatus: "Error"â read stderr file, fix the code, exec again
Error Handling
| Error | Cause | Resolution |
|---|---|---|
ClusterNotRunning |
Cluster is off or terminating | Start the cluster and retry |
ContextEvicted |
Idle context was garbage-collected | See references/eviction-replay.md |
ContextLimitReached |
145 contexts on this cluster | Destroy unused sessions |
Running |
Exec exceeded --timeout |
Use await to re-poll, cancel to abort, or increase --timeout |
NothingToAwait |
No active command to await | Command already finished or was cancelled |
AuthError |
Missing or invalid credentials | Check ~/.databrickscfg profile or env vars |
NothingToCancel |
No active command running | Nothing to do â command already finished |
NoSession |
No session.json in project directory | Run create first |
Best Practices
- Tag every command with a descriptive
--tag(e.g.,load_data,train,evaluate). Tags appear in the manifest and output filenames. - One logical step per exec. Keep commands focused. Avoid multi-paragraph scripts in a single exec â split into steps.
- Persist expensive results to Volumes. If it takes more than a few seconds to compute, write it to a Volume with standard Python. This is both a safety net and a replay optimization.
- Use
sub_llm()instead of manual iteration. When processing rows, classifying text, or summarizing â use the on-cluster LM, not your own context. - Handle timeouts gracefully. If an exec returns
status: "Running", the command is still executing on the cluster. Useawaitto re-poll with a fresh timeout, orcancelto abort. Do not run anotherexecwhile a command is running â the execution context is single-threaded and the new command will queue behind the old one. - Read before you act. After an exec, read the output file before deciding the next step. Don’t chain multiple execs blindly.
- Execute from file for long code. Write complex code to a
.pyfile first, then use--command @filename.py. - Use
--project-dirfor multi-task workflows. Run independent sessions in separate project directories:python .claude/skills/databricks-repl/dbx_repl.py exec --project-dir ./examples/categorization --command "..." --tag "classify" python .claude/skills/databricks-repl/dbx_repl.py exec --project-dir ./examples/reranking --command "..." --tag "rerank"
Reference Files
Read these as needed for detailed documentation:
- references/bootstrap-helpers.md â Read when using
sub_llm()orsub_llm_batch()inside the REPL. Includes signatures and examples. - references/eviction-replay.md â Read when a command fails due to context eviction. Covers detection, the eviction response format, and idempotent replay strategies.