run-job
npx skills add https://github.com/liorz/vastai-claude-skill --skill run-job
Agent 安装分布
Skill 文档
Vast.ai Job Runner Agent
You are an autonomous agent that runs GPU jobs on Vast.ai end-to-end. You will search for a GPU, launch an instance, run the user’s job, monitor it to completion, and clean up by destroying the instance.
User’s Job Request
$ARGUMENTS
Workflow
Follow these phases in order. If any phase fails, attempt recovery. If unrecoverable, destroy the instance to avoid billing and report the error.
Phase 1: Understand the Job
Parse the user’s request to determine:
- What to run: Script, command, training job, inference task, etc.
- GPU requirements: Model, VRAM, count (default: 1à GPU with â¥24GB VRAM)
- Environment (pick one â ask the user if ambiguous):
- Docker image (
--image): e.g.pytorch/pytorch,nvidia/cuda:12.1.0-devel-ubuntu22.04 - Template (
--template_hashor--template_id): A pre-configured environment fromvastai search templates - If the user says “use my template” or gives a hash/ID, use that instead of an image
- Docker image (
- Setup commands: Packages to install, repos to clone, env vars to set (goes in
--onstart-cmd) - Docker registry auth (
--login): Only if using a private image - Disk space: How much storage (default: 64GB)
- Files to upload: Any local files/scripts to send to the instance
- Results to download: What outputs to retrieve when done
- Budget: Max $/hr the user is willing to pay
- Estimated duration: How long the job will take (affects spot vs on-demand choice)
Image selection guide â infer from the job if not specified:
| Job type | Suggested image |
|---|---|
| PyTorch training/fine-tuning | pytorch/pytorch |
| General CUDA/C++ work | nvidia/cuda:12.1.0-devel-ubuntu22.04 |
| Hugging Face / transformers | pytorch/pytorch + onstart pip install |
| TensorFlow | tensorflow/tensorflow:latest-gpu |
| vLLM / LLM serving | vllm/vllm-openai:latest |
| JAX | nvidia/cuda:12.1.0-devel-ubuntu22.04 + onstart pip install |
| General Python ML | pytorch/pytorch |
| Custom / user-specified | Whatever they provide |
If the user provides a template hash/ID, skip image selection entirely â the template includes the image.
If critical details are unclear, ask the user with AskUserQuestion. For ambiguous but non-critical details, use sensible defaults.
Phase 2: Verify SSH Key Setup
Before launching anything, verify the user has SSH keys configured â without them, you cannot connect to the instance.
vastai show ssh-keys
If keys exist: Note the key ID. Determine which local private key corresponds to it. Check common locations:
ls -la ~/.ssh/id_ed25519 ~/.ssh/id_rsa ~/.ssh/id_ecdsa 2>/dev/null
Store the path to the private key â you’ll need it for every ssh and scp command (the -i flag).
If no keys exist: Ask the user what to do:
- Upload their existing public key (e.g.
~/.ssh/id_ed25519.pub):vastai create ssh-key "$(cat ~/.ssh/id_ed25519.pub)" - Or generate a new keypair and upload it:
ssh-keygen -t ed25519 -f ~/.ssh/vastai_ed25519 -N "" -C "vastai" vastai create ssh-key "$(cat ~/.ssh/vastai_ed25519.pub)"
IMPORTANT: Store the SSH private key path (e.g. SSH_KEY=~/.ssh/id_ed25519) for use in all subsequent SSH/SCP commands. Every ssh and scp command in this workflow MUST include -i $SSH_KEY.
Phase 3: Find a GPU
Search for suitable offers:
vastai search offers '<QUERY>' -o 'dph_total' --raw
Build the query from requirements. Always include reliability>0.9. Default search:
vastai search offers 'gpu_ram>=24 num_gpus=1 reliability>0.9 disk_space>=64' -o 'dph_total' --raw
Parse the JSON output. Select the cheapest suitable offer. Note:
- The offer
idâ needed to create the instance - The
dph_totalâ cost per hour - The
gpu_nameâ what GPU it is
Tell the user what you found and the cost. Ask for confirmation before proceeding.
Phase 4: Launch the Instance
Create the instance with SSH access. Choose the right flags based on what the user specified:
Option A â Using a Docker image (most common):
vastai create instance <OFFER_ID> \
--image <DOCKER_IMAGE> \
--disk <DISK_GB> \
--ssh \
--label 'claude-job-<timestamp>' \
--onstart-cmd '<SETUP_SCRIPT>' \
[--env '<-e KEY=VAL -p HOST:CONTAINER>'] \
[--login '<REGISTRY_AUTH>']
Option B â Using a template:
vastai create instance <OFFER_ID> \
--template_hash <HASH> \
--disk <DISK_GB> \
--ssh \
--label 'claude-job-<timestamp>' \
[--onstart-cmd '<EXTRA_SETUP>']
Option C â Using a template ID:
vastai create instance <OFFER_ID> \
--template_id <ID> \
--disk <DISK_GB> \
--ssh \
--label 'claude-job-<timestamp>'
Onstart-cmd tips:
- Use for:
pip install,apt-get install,git clone, setting env vars, downloading datasets - Keep the container running â do NOT put the main job in onstart-cmd (use SSH for that)
- Separate multiple commands with
&∨ - Example:
'pip install transformers datasets && git clone https://github.com/user/repo /root/repo'
Volume attachment (if user needs persistent storage):
--create-volume <VOLUME_OFFER_ID> --volume-size <GB> --mount-path /root/data
# OR
--link-volume <EXISTING_VOLUME_ID> --mount-path /root/data
Capture the instance ID from the response (new_contract field).
Phase 5: Wait for Instance Ready
Poll until the instance is running:
# Poll every 10 seconds, up to 5 minutes
for i in $(seq 1 30); do
STATUS=$(vastai show instance <ID> --raw 2>/dev/null | jq -r '.actual_status // .status // "unknown"')
echo "Attempt $i: status=$STATUS"
if [ "$STATUS" = "running" ]; then
echo "Instance is running!"
break
fi
if [ "$STATUS" = "exited" ] || [ "$STATUS" = "offline" ]; then
echo "Instance failed with status: $STATUS"
break
fi
sleep 10
done
If the instance fails to start within 5 minutes, destroy it and try the next offer.
Once running, get SSH connection info:
vastai ssh-url <ID>
Parse the SSH URL to extract host and port. The format is ssh://root@<HOST>:<PORT>.
Wait an additional 15-20 seconds after “running” status for SSH to become available, then test connectivity:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -o ConnectTimeout=10 -p <PORT> root@<HOST> 'echo connected'
Phase 6: Upload Files (if needed)
Always tar+gzip before transferring to minimize transfer time. Never scp files individually.
Single file:
scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> <LOCAL_FILE> root@<HOST>:/root/
Multiple files or directories â tar+gzip and stream directly (no temp file):
tar czf - -C <LOCAL_BASE_DIR> <PATHS...> | \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar xzf - -C /root/'
Examples:
# Upload a directory
tar czf - -C /home/user my_project/ | \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tar xzf - -C /root/'
# Upload multiple files from same parent dir
tar czf - -C /home/user/data file1.csv file2.csv model.bin | \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tar xzf - -C /root/data/'
# Upload files from different locations â create a staging tarball first
tar czf /tmp/upload.tar.gz -C /path/a fileA -C /path/b fileB && \
scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> /tmp/upload.tar.gz root@<HOST>:/root/ && \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'cd /root && tar xzf upload.tar.gz && rm upload.tar.gz'
Phase 7: Run the Job
Execute the job via SSH. For long-running jobs, use nohup or tmux and redirect output:
# For short jobs (< 10 min), run directly:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> '<COMMAND>'
# For long jobs, use nohup so it survives SSH disconnect:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'nohup bash -c "<COMMAND>" > /root/job_output.log 2>&1 & echo $!'
Capture the PID if running in background.
Phase 8: Monitor Progress
For background jobs, poll for completion:
# Check if process is still running
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'kill -0 <PID> 2>/dev/null && echo running || echo done'
# Check latest output
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tail -20 /root/job_output.log'
Also check instance logs periodically:
vastai logs <ID> --tail 50
Report progress to the user periodically. If the job appears stuck or erroring, alert the user.
Poll interval: every 30 seconds for jobs < 10 min, every 2 minutes for longer jobs.
Phase 9: Retrieve Results (if needed)
Always tar+gzip results on the remote side and stream back. Never scp -r directories.
Single file:
scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> root@<HOST>:<REMOTE_FILE> <LOCAL_PATH>
Directory or multiple outputs â tar+gzip and stream directly:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root <RESULT_PATHS...>' | \
tar xzf - -C <LOCAL_DEST>
Examples:
# Download an output directory
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root output/' | tar xzf - -C ./
# Download multiple result paths
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root output/ checkpoints/ job_output.log' | tar xzf - -C ./results/
# Download and save as a single archive (preserves everything)
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root output/' > results.tar.gz
Common result locations to check:
- Model checkpoints:
/root/output/,/root/checkpoints/ - Logs:
/root/job_output.log - Whatever the user specified
Phase 10: Cleanup
Always destroy the instance when done (or on failure):
vastai destroy instance <ID>
Verify destruction:
vastai show instances --raw | jq '.[] | select(.id == <ID>)'
Phase 11: Report
Provide the user with a summary:
- GPU used and cost per hour
- Total runtime and estimated cost
- Job output / results location
- Any errors encountered
- Confirmation that the instance was destroyed
Error Recovery
- Instance won’t start: Destroy it, pick next cheapest offer, retry (up to 3 attempts)
- SSH won’t connect: Wait 30 more seconds, retry 3 times, then check instance status
- Job fails: Show the user the error output, ask if they want to debug (keep instance) or abort (destroy)
- Network error during monitoring: Retry SSH connection, check instance is still running
- Any unrecoverable error: ALWAYS destroy the instance to stop billing, then report
Safety Rules
- Always confirm cost with the user before creating an instance
- Always destroy the instance when done or on failure â never leave it running
- Never run
--forceunless the user explicitly asks - Track the instance ID throughout â you need it for cleanup
- If you lose track of the instance ID, run
vastai show instancesto find it by label