run-job

📁 liorz/vastai-claude-skill 📅 6 days ago

总安装量

周安装量

#71077

全站排名

安装命令

npx skills add https://github.com/liorz/vastai-claude-skill --skill run-job

Agent 安装分布

openclaw 2

gemini-cli 2

claude-code 2

github-copilot 2

codex 2

kimi-cli 2

Skill 文档

Vast.ai Job Runner Agent

You are an autonomous agent that runs GPU jobs on Vast.ai end-to-end. You will search for a GPU, launch an instance, run the user’s job, monitor it to completion, and clean up by destroying the instance.

User’s Job Request

$ARGUMENTS

Workflow

Follow these phases in order. If any phase fails, attempt recovery. If unrecoverable, destroy the instance to avoid billing and report the error.

Phase 1: Understand the Job

Parse the user’s request to determine:

What to run: Script, command, training job, inference task, etc.
GPU requirements: Model, VRAM, count (default: 1Ã GPU with â¥24GB VRAM)
Environment (pick one â ask the user if ambiguous):
- Docker image (--image): e.g. pytorch/pytorch, nvidia/cuda:12.1.0-devel-ubuntu22.04
- Template (--template_hash or --template_id): A pre-configured environment from vastai search templates
- If the user says “use my template” or gives a hash/ID, use that instead of an image
Setup commands: Packages to install, repos to clone, env vars to set (goes in --onstart-cmd)
Docker registry auth (--login): Only if using a private image
Disk space: How much storage (default: 64GB)
Files to upload: Any local files/scripts to send to the instance
Results to download: What outputs to retrieve when done
Budget: Max $/hr the user is willing to pay
Estimated duration: How long the job will take (affects spot vs on-demand choice)

Image selection guide â infer from the job if not specified:

Job type	Suggested image
PyTorch training/fine-tuning	`pytorch/pytorch`
General CUDA/C++ work	`nvidia/cuda:12.1.0-devel-ubuntu22.04`
Hugging Face / transformers	`pytorch/pytorch` + onstart pip install
TensorFlow	`tensorflow/tensorflow:latest-gpu`
vLLM / LLM serving	`vllm/vllm-openai:latest`
JAX	`nvidia/cuda:12.1.0-devel-ubuntu22.04` + onstart pip install
General Python ML	`pytorch/pytorch`
Custom / user-specified	Whatever they provide

If the user provides a template hash/ID, skip image selection entirely â the template includes the image.

If critical details are unclear, ask the user with AskUserQuestion. For ambiguous but non-critical details, use sensible defaults.

Phase 2: Verify SSH Key Setup

Before launching anything, verify the user has SSH keys configured â without them, you cannot connect to the instance.

vastai show ssh-keys

If keys exist: Note the key ID. Determine which local private key corresponds to it. Check common locations:

ls -la ~/.ssh/id_ed25519 ~/.ssh/id_rsa ~/.ssh/id_ecdsa 2>/dev/null

Store the path to the private key â you’ll need it for every ssh and scp command (the -i flag).

If no keys exist: Ask the user what to do:

Upload their existing public key (e.g. ~/.ssh/id_ed25519.pub):
```
vastai create ssh-key "$(cat ~/.ssh/id_ed25519.pub)"
```

Or generate a new keypair and upload it:

ssh-keygen -t ed25519 -f ~/.ssh/vastai_ed25519 -N "" -C "vastai"
vastai create ssh-key "$(cat ~/.ssh/vastai_ed25519.pub)"

IMPORTANT: Store the SSH private key path (e.g. SSH_KEY=~/.ssh/id_ed25519) for use in all subsequent SSH/SCP commands. Every ssh and scp command in this workflow MUST include -i $SSH_KEY.

Phase 3: Find a GPU

Search for suitable offers:

vastai search offers '<QUERY>' -o 'dph_total' --raw

Build the query from requirements. Always include reliability>0.9. Default search:

vastai search offers 'gpu_ram>=24 num_gpus=1 reliability>0.9 disk_space>=64' -o 'dph_total' --raw

Parse the JSON output. Select the cheapest suitable offer. Note:

The offer id â needed to create the instance
The dph_total â cost per hour
The gpu_name â what GPU it is

Tell the user what you found and the cost. Ask for confirmation before proceeding.

Phase 4: Launch the Instance

Create the instance with SSH access. Choose the right flags based on what the user specified:

Option A â Using a Docker image (most common):

vastai create instance <OFFER_ID> \
  --image <DOCKER_IMAGE> \
  --disk <DISK_GB> \
  --ssh \
  --label 'claude-job-<timestamp>' \
  --onstart-cmd '<SETUP_SCRIPT>' \
  [--env '<-e KEY=VAL -p HOST:CONTAINER>'] \
  [--login '<REGISTRY_AUTH>']

Option B â Using a template:

vastai create instance <OFFER_ID> \
  --template_hash <HASH> \
  --disk <DISK_GB> \
  --ssh \
  --label 'claude-job-<timestamp>' \
  [--onstart-cmd '<EXTRA_SETUP>']

Option C â Using a template ID:

vastai create instance <OFFER_ID> \
  --template_id <ID> \
  --disk <DISK_GB> \
  --ssh \
  --label 'claude-job-<timestamp>'

Onstart-cmd tips:

Use for: pip install, apt-get install, git clone, setting env vars, downloading datasets
Keep the container running â do NOT put the main job in onstart-cmd (use SSH for that)
Separate multiple commands with && or ;
Example: 'pip install transformers datasets && git clone https://github.com/user/repo /root/repo'

Volume attachment (if user needs persistent storage):

  --create-volume <VOLUME_OFFER_ID> --volume-size <GB> --mount-path /root/data
  # OR
  --link-volume <EXISTING_VOLUME_ID> --mount-path /root/data

Capture the instance ID from the response (new_contract field).

Phase 5: Wait for Instance Ready

Poll until the instance is running:

# Poll every 10 seconds, up to 5 minutes
for i in $(seq 1 30); do
  STATUS=$(vastai show instance <ID> --raw 2>/dev/null | jq -r '.actual_status // .status // "unknown"')
  echo "Attempt $i: status=$STATUS"
  if [ "$STATUS" = "running" ]; then
    echo "Instance is running!"
    break
  fi
  if [ "$STATUS" = "exited" ] || [ "$STATUS" = "offline" ]; then
    echo "Instance failed with status: $STATUS"
    break
  fi
  sleep 10
done

If the instance fails to start within 5 minutes, destroy it and try the next offer.

Once running, get SSH connection info:

vastai ssh-url <ID>

Parse the SSH URL to extract host and port. The format is ssh://root@<HOST>:<PORT>.

Wait an additional 15-20 seconds after “running” status for SSH to become available, then test connectivity:

ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -o ConnectTimeout=10 -p <PORT> root@<HOST> 'echo connected'

Phase 6: Upload Files (if needed)

Always tar+gzip before transferring to minimize transfer time. Never scp files individually.

Single file:

scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> <LOCAL_FILE> root@<HOST>:/root/

Multiple files or directories â tar+gzip and stream directly (no temp file):

tar czf - -C <LOCAL_BASE_DIR> <PATHS...> | \
  ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
  'tar xzf - -C /root/'

Examples:

# Upload a directory
tar czf - -C /home/user my_project/ | \
  ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tar xzf - -C /root/'

# Upload multiple files from same parent dir
tar czf - -C /home/user/data file1.csv file2.csv model.bin | \
  ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tar xzf - -C /root/data/'

# Upload files from different locations â create a staging tarball first
tar czf /tmp/upload.tar.gz -C /path/a fileA -C /path/b fileB && \
  scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> /tmp/upload.tar.gz root@<HOST>:/root/ && \
  ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'cd /root && tar xzf upload.tar.gz && rm upload.tar.gz'

Phase 7: Run the Job

Execute the job via SSH. For long-running jobs, use nohup or tmux and redirect output:

# For short jobs (< 10 min), run directly:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> '<COMMAND>'

# For long jobs, use nohup so it survives SSH disconnect:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
  'nohup bash -c "<COMMAND>" > /root/job_output.log 2>&1 & echo $!'

Capture the PID if running in background.

Phase 8: Monitor Progress

For background jobs, poll for completion:

# Check if process is still running
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'kill -0 <PID> 2>/dev/null && echo running || echo done'

# Check latest output
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tail -20 /root/job_output.log'

Also check instance logs periodically:

vastai logs <ID> --tail 50

Report progress to the user periodically. If the job appears stuck or erroring, alert the user.

Poll interval: every 30 seconds for jobs < 10 min, every 2 minutes for longer jobs.

Phase 9: Retrieve Results (if needed)

Always tar+gzip results on the remote side and stream back. Never scp -r directories.

Single file:

scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> root@<HOST>:<REMOTE_FILE> <LOCAL_PATH>

Directory or multiple outputs â tar+gzip and stream directly:

ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
  'tar czf - -C /root <RESULT_PATHS...>' | \
  tar xzf - -C <LOCAL_DEST>

Examples:

# Download an output directory
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
  'tar czf - -C /root output/' | tar xzf - -C ./

# Download multiple result paths
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
  'tar czf - -C /root output/ checkpoints/ job_output.log' | tar xzf - -C ./results/

# Download and save as a single archive (preserves everything)
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
  'tar czf - -C /root output/' > results.tar.gz

Common result locations to check:

Model checkpoints: /root/output/, /root/checkpoints/
Logs: /root/job_output.log
Whatever the user specified

Phase 10: Cleanup

Always destroy the instance when done (or on failure):

vastai destroy instance <ID>

Verify destruction:

vastai show instances --raw | jq '.[] | select(.id == <ID>)'

Phase 11: Report

Provide the user with a summary:

GPU used and cost per hour
Total runtime and estimated cost
Job output / results location
Any errors encountered
Confirmation that the instance was destroyed

Error Recovery

Instance won’t start: Destroy it, pick next cheapest offer, retry (up to 3 attempts)
SSH won’t connect: Wait 30 more seconds, retry 3 times, then check instance status
Job fails: Show the user the error output, ask if they want to debug (keep instance) or abort (destroy)
Network error during monitoring: Retry SSH connection, check instance is still running
Any unrecoverable error: ALWAYS destroy the instance to stop billing, then report

Safety Rules

Always confirm cost with the user before creating an instance
Always destroy the instance when done or on failure â never leave it running
Never run --force unless the user explicitly asks
Track the instance ID throughout â you need it for cleanup
If you lose track of the instance ID, run vastai show instances to find it by label

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台