databricks-repl-consolidate

📁 wedneyyuri/databricks-repl 📅 5 days ago
4
总安装量
4
周安装量
#48823
全站排名
安装命令
npx skills add https://github.com/wedneyyuri/databricks-repl --skill databricks-repl-consolidate

Agent 安装分布

opencode 4
gemini-cli 4
github-copilot 4
codex 4
kimi-cli 4
cursor 4

Skill 文档

Session Consolidation

Produce a single, clean .py file from a Databricks REPL session by reading session.json and the .cmd.py files.

Workflow

  1. Read session.json — the steps array contains the ordered list of steps with status and command file paths.
  2. Read each .cmd.py file — in step order, skipping failed steps (only successful steps survive).
  3. Strip REPL boilerplate — remove or convert REPL-specific calls (see Boilerplate Rules).
  4. Deduplicate — if a step was retried after an error, only keep the final successful version.
  5. Resolve imports — collect all imports from across cells and deduplicate them at the top of the file.
  6. Write the output — a single .py file with a clear structure.

Output Structure

"""
Consolidated from session: <session_name>
Source: <session_file_path>
Steps: <N> (of <total> attempted)
"""

# --- Dependencies ---
# Requires: scikit-learn, xgboost

# --- Imports ---
import os
import json
from sklearn.ensemble import RandomForestClassifier
# ...

# --- Step 1: load_data ---
df = spark.read.table("catalog.schema.table")
# ...

# --- Step 2: feature_engineering ---
# ...

# --- Step 3: train ---
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
joblib.dump(model, "/Volumes/catalog/schema/vol/model.pkl")
# ...

# --- Step 4: evaluate ---
# ...

Boilerplate Rules

Transform REPL-specific code into clean Python:

REPL Code Consolidated Form
%pip install xgboost Move to # Requires: xgboost in header
sub_llm(prompt, ...) Keep as-is (it’s business logic)
sub_llm_batch(prompts, ...) Keep as-is (it’s business logic)

Key distinctions:

  • %pip install → collect into a # Requires: header comment
  • sub_llm() / sub_llm_batch() → keep unchanged, these are meaningful business logic
  • print() statements used only for REPL feedback → remove
  • print() statements that display meaningful results → keep

Deduplication Rules

Sessions often contain retries after errors. When multiple steps share the same tag:

  1. Find all steps with the same tag in session.json
  2. Keep only the last one with status: "Finished"
  3. Discard earlier failed attempts

When adjacent steps do the same thing (e.g., loading the same table with slight variations), keep only the final version.

Import Resolution

  1. Scan all surviving steps for import and from ... import statements
  2. Deduplicate — same import appearing in multiple steps becomes one line
  3. Place all imports at the top of the file, after the docstring and dependencies comment
  4. Remove imports that are no longer used after boilerplate stripping

Before / After Example

Before (3 separate .cmd.py files)

001_install.cmd.py:

%pip install scikit-learn pandas

002_load.cmd.py:

import pandas as pd
df = spark.read.table("catalog.schema.customers").toPandas()
print(f"Loaded {len(df)} rows")

003_train.cmd.py:

from sklearn.ensemble import RandomForestClassifier
import joblib

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(df[features], df["label"])
joblib.dump(model, "/Volumes/catalog/schema/vol/model.pkl")
print("Training complete")

After (consolidated .py)

"""
Consolidated from session: customer-classifier
Source: ./session.json
Steps: 3 (of 3 attempted)
"""

# --- Dependencies ---
# Requires: scikit-learn, pandas

# --- Imports ---
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# --- Step 1: load ---
df = spark.read.table("catalog.schema.customers").toPandas()

# --- Step 2: train ---
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(df[features], df["label"])
joblib.dump(model, "/Volumes/catalog/schema/vol/model.pkl")

Usage

  1. Ensure session.json has a steps array with at least one successful step
  2. Read session.json to understand the session structure
  3. Read each .cmd.py file referenced in the steps
  4. Apply the boilerplate rules, deduplication, and import resolution
  5. Write the consolidated file (default: <session_name>.py in the repo root)
  6. Review the output for correctness — automated consolidation may miss nuances in variable dependencies across steps