databricks-repl-consolidate

📁 wedneyyuri/databricks-repl 📅 5 days ago

总安装量

周安装量

#48823

全站排名

安装命令

npx skills add https://github.com/wedneyyuri/databricks-repl --skill databricks-repl-consolidate

Agent 安装分布

opencode 4

gemini-cli 4

github-copilot 4

codex 4

kimi-cli 4

cursor 4

Skill 文档

Session Consolidation

Produce a single, clean .py file from a Databricks REPL session by reading session.json and the .cmd.py files.

Workflow

Read session.json â the steps array contains the ordered list of steps with status and command file paths.
Read each .cmd.py file â in step order, skipping failed steps (only successful steps survive).
Strip REPL boilerplate â remove or convert REPL-specific calls (see Boilerplate Rules).
Deduplicate â if a step was retried after an error, only keep the final successful version.
Resolve imports â collect all imports from across cells and deduplicate them at the top of the file.
Write the output â a single .py file with a clear structure.

Output Structure

"""
Consolidated from session: <session_name>
Source: <session_file_path>
Steps: <N> (of <total> attempted)
"""

# --- Dependencies ---
# Requires: scikit-learn, xgboost

# --- Imports ---
import os
import json
from sklearn.ensemble import RandomForestClassifier
# ...

# --- Step 1: load_data ---
df = spark.read.table("catalog.schema.table")
# ...

# --- Step 2: feature_engineering ---
# ...

# --- Step 3: train ---
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
joblib.dump(model, "/Volumes/catalog/schema/vol/model.pkl")
# ...

# --- Step 4: evaluate ---
# ...

Boilerplate Rules

Transform REPL-specific code into clean Python:

REPL Code	Consolidated Form
`%pip install xgboost`	Move to `# Requires: xgboost` in header
`sub_llm(prompt, ...)`	Keep as-is (it’s business logic)
`sub_llm_batch(prompts, ...)`	Keep as-is (it’s business logic)

Key distinctions:

%pip install â collect into a # Requires: header comment
sub_llm() / sub_llm_batch() â keep unchanged, these are meaningful business logic
print() statements used only for REPL feedback â remove
print() statements that display meaningful results â keep

Deduplication Rules

Sessions often contain retries after errors. When multiple steps share the same tag:

Find all steps with the same tag in session.json
Keep only the last one with status: "Finished"
Discard earlier failed attempts

When adjacent steps do the same thing (e.g., loading the same table with slight variations), keep only the final version.

Import Resolution

Scan all surviving steps for import and from ... import statements
Deduplicate â same import appearing in multiple steps becomes one line
Place all imports at the top of the file, after the docstring and dependencies comment
Remove imports that are no longer used after boilerplate stripping

Before / After Example

Before (3 separate `.cmd.py` files)

001_install.cmd.py:

%pip install scikit-learn pandas

002_load.cmd.py:

import pandas as pd
df = spark.read.table("catalog.schema.customers").toPandas()
print(f"Loaded {len(df)} rows")

003_train.cmd.py:

from sklearn.ensemble import RandomForestClassifier
import joblib

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(df[features], df["label"])
joblib.dump(model, "/Volumes/catalog/schema/vol/model.pkl")
print("Training complete")

After (consolidated `.py`)

"""
Consolidated from session: customer-classifier
Source: ./session.json
Steps: 3 (of 3 attempted)
"""

# --- Dependencies ---
# Requires: scikit-learn, pandas

# --- Imports ---
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# --- Step 1: load ---
df = spark.read.table("catalog.schema.customers").toPandas()

# --- Step 2: train ---
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(df[features], df["label"])
joblib.dump(model, "/Volumes/catalog/schema/vol/model.pkl")

Usage

Ensure session.json has a steps array with at least one successful step
Read session.json to understand the session structure
Read each .cmd.py file referenced in the steps
Apply the boilerplate rules, deduplication, and import resolution
Write the consolidated file (default: <session_name>.py in the repo root)
Review the output for correctness â automated consolidation may miss nuances in variable dependencies across steps

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台