pueue-job-orchestration

📁 terrylica/cc-skills 📅 6 days ago
25
总安装量
2
周安装量
#14682
全站排名
安装命令
npx skills add https://github.com/terrylica/cc-skills --skill pueue-job-orchestration

Agent 安装分布

augment 2
gemini-cli 2
antigravity 2
command-code 2
claude-code 2
github-copilot 2

Skill 文档

Pueue Job Orchestration

Manage long-running tasks on BigBlack/LittleBlack GPU workstations using Pueue job queue.

Overview

Pueue is a Rust CLI tool for managing shell command queues. It provides:

  • Daemon persistence – Survives SSH disconnects, crashes, reboots
  • Disk-backed queue – Auto-resumes after any failure
  • Group-based parallelism – Control concurrent jobs per group
  • Easy failure recovery – Restart failed jobs with one command

When to Use This Skill

Use this skill when the user mentions:

Trigger Example
Running tasks on BigBlack/LittleBlack “Run this on bigblack”
Long-running data processing “Populate the cache for all symbols”
Batch/parallel operations “Process these 70 jobs”
SSH remote execution “Execute this overnight on the GPU server”
Cache population “Fill the ClickHouse cache”

Quick Reference

Check Status

# Local
pueue status

# Remote (BigBlack)
ssh bigblack "~/.local/bin/pueue status"

Queue a Job

# Local
pueue add -- python long_running_script.py

# Remote (BigBlack)
ssh bigblack "~/.local/bin/pueue add -- cd ~/project && uv run python script.py"

# With group (for parallelism control)
pueue add --group p1 --label "BTCUSDT@1000" -- python populate.py --symbol BTCUSDT

Monitor Jobs

pueue follow <id>         # Watch job output in real-time
pueue log <id>            # View completed job output
pueue log <id> --full     # Full output (not truncated)

Manage Jobs

pueue restart <id>        # Restart failed job
pueue restart --all-failed # Restart ALL failed jobs
pueue kill <id>           # Kill running job
pueue clean               # Remove completed jobs from list
pueue reset               # Clear all jobs (use with caution)

Host Configuration

Host Location Parallelism Groups
BigBlack ~/.local/bin/pueue p1 (4), p2 (2), p3 (3), p4 (1)
LittleBlack ~/.local/bin/pueue default (2)
Local (macOS) /opt/homebrew/bin/pueue default

Workflows

1. Queue Single Remote Job

# Step 1: Verify daemon is running
ssh bigblack "~/.local/bin/pueue status"

# Step 2: Queue the job
ssh bigblack "~/.local/bin/pueue add --label 'my-job' -- cd ~/project && uv run python script.py"

# Step 3: Monitor progress
ssh bigblack "~/.local/bin/pueue follow <id>"

2. Batch Job Submission (Multiple Symbols)

For rangebar cache population or similar batch operations:

# Use the pueue-populate.sh script
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh setup"   # One-time
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh phase1"  # Queue Phase 1
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh status"  # Check progress

3. Configure Parallelism Groups

# Create groups with different parallelism limits
pueue group add fast      # Create 'fast' group
pueue parallel 4 --group fast  # Allow 4 parallel jobs

pueue group add slow
pueue parallel 1 --group slow  # Sequential execution

# Queue jobs to specific groups
pueue add --group fast -- echo "fast job"
pueue add --group slow -- echo "slow job"

4. Handle Failed Jobs

# Check what failed
pueue status | grep Failed

# View error output
pueue log <id>

# Restart specific job
pueue restart <id>

# Restart all failed jobs
pueue restart --all-failed

Installation

macOS (Local)

brew install pueue
pueued -d  # Start daemon

Linux (BigBlack/LittleBlack)

# Download from GitHub releases (see https://github.com/Nukesor/pueue/releases for latest)
curl -sSL https://raw.githubusercontent.com/terrylica/rangebar-py/main/scripts/setup-pueue-linux.sh | bash

# Or manually:
# SSoT-OK: Version from GitHub releases page
PUEUE_VERSION="v4.0.2"
curl -sSL "https://github.com/Nukesor/pueue/releases/download/${PUEUE_VERSION}/pueue-x86_64-unknown-linux-musl" -o ~/.local/bin/pueue
curl -sSL "https://github.com/Nukesor/pueue/releases/download/${PUEUE_VERSION}/pueued-x86_64-unknown-linux-musl" -o ~/.local/bin/pueued
chmod +x ~/.local/bin/pueue ~/.local/bin/pueued

# Start daemon
~/.local/bin/pueued -d

Systemd Auto-Start (Linux)

mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/pueued.service << 'EOF'
[Unit]
Description=Pueue Daemon
After=network.target

[Service]
ExecStart=%h/.local/bin/pueued -v
Restart=on-failure

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable --now pueued

Integration with rangebar-py

The rangebar-py project has Pueue integration scripts:

Script Purpose
scripts/pueue-populate.sh Queue cache population jobs with group-based parallelism
scripts/setup-pueue-linux.sh Install Pueue on Linux servers
scripts/populate_full_cache.py Python script for individual symbol/threshold jobs

Phase-Based Execution

# Phase 1: 1000 dbps (fast, 4 parallel)
./scripts/pueue-populate.sh phase1

# Phase 2: 250 dbps (moderate, 2 parallel)
./scripts/pueue-populate.sh phase2

# Phase 3: 500, 750 dbps (3 parallel)
./scripts/pueue-populate.sh phase3

# Phase 4: 100 dbps (resource intensive, 1 at a time)
./scripts/pueue-populate.sh phase4

Troubleshooting

Issue Cause Solution
pueue: command not found Not in PATH Use full path: ~/.local/bin/pueue
Connection refused Daemon not running Start with pueued -d
Jobs stuck in Queued Group paused or at limit Check pueue status, pueue start
SSH disconnect kills jobs Not using Pueue Queue via Pueue instead of direct SSH
Job fails immediately Wrong working directory Use cd /path && command pattern

Related

  • Hook: itp-hooks/posttooluse-reminder.ts – Reminds to use Pueue for detected long-running commands
  • Reference: Pueue GitHub
  • Issue: rangebar-py#77 – Original implementation