desktop control

📁 patrickporto/desktop-agent 📅 Jan 1, 1970

总安装量

周安装量

#2652

全站排名

安装命令

npx skills add https://github.com/patrickporto/desktop-agent --skill Desktop Control

Skill 文档

Desktop Control Skill

This skill provides comprehensive desktop automation capabilities through PyAutoGUI, allowing AI agents to control the mouse, keyboard, take screenshots, and interact with the desktop environment.

How to Use This Skill

As an AI agent, you can invoke desktop automation commands using the uvx desktop-agent CLI.

Command Structure

All commands follow this pattern:

uvx desktop-agent <category> <command> [arguments] [options]

Categories:

mouse – Mouse control
keyboard – Keyboard input
screen – Screenshots and screen analysis
message – User dialogs
app – Application control (open, focus, list windows)

Available Commands

ð±ï¸ Mouse Control (`mouse`)

Control cursor movement and clicks.

# Move cursor to coordinates
uvx desktop-agent mouse move <x> <y> [--duration SECONDS]

# Click at current position or specific coordinates
uvx desktop-agent mouse click [x] [y] [--button left|right|middle] [--clicks N]

# Specialized clicks
uvx desktop-agent mouse double-click [x] [y]
uvx desktop-agent mouse right-click [x] [y]
uvx desktop-agent mouse middle-click [x] [y]

# Drag to coordinates
uvx desktop-agent mouse drag <x> <y> [--duration SECONDS] [--button BUTTON]

# Scroll (positive=up, negative=down)
uvx desktop-agent mouse scroll <clicks> [x] [y]

# Get current mouse position
uvx desktop-agent mouse position

Examples:

# Move to center of 1920x1080 screen
uvx desktop-agent mouse move 960 540 --duration 0.5

# Right-click at specific location
uvx desktop-agent mouse right-click 500 300

# Scroll down 5 clicks
uvx desktop-agent mouse scroll -5

â¨ï¸ Keyboard Control (`keyboard`)

Type text and execute keyboard shortcuts.

# Type text
uvx desktop-agent keyboard write "<text>" [--interval SECONDS]

# Press keys
uvx desktop-agent keyboard press <key> [--presses N] [--interval SECONDS]

# Execute hotkey combination (comma-separated)
uvx desktop-agent keyboard hotkey "<key1>,<key2>,..."

# Hold/release keys
uvx desktop-agent keyboard keydown <key>
uvx desktop-agent keyboard keyup <key>

Examples:

# Type text with natural delay
uvx desktop-agent keyboard write "Hello World" --interval 0.05

# Copy selected text
uvx desktop-agent keyboard hotkey "ctrl,c"

# Open Task Manager
uvx desktop-agent keyboard hotkey "ctrl,shift,esc"

# Press Enter 3 times
uvx desktop-agent keyboard press enter --presses 3

Common Key Names:

Modifiers: ctrl, shift, alt, win
Special: enter, tab, esc, space, backspace, delete
Function: f1 through f12
Arrows: up, down, left, right

ð¼ï¸ Screen & Screenshots (`screen`)

Capture screenshots and analyze screen content. Supports targeting specific windows.

# Take screenshot
uvx desktop-agent screen screenshot <filename> [--region "x,y,width,height"] [--window <title>] [--active]

# Locate image on screen or within window
uvx desktop-agent screen locate <image_path> [--confidence 0.0-1.0] [--window <title>] [--active]
uvx desktop-agent screen locate-center <image_path> [--confidence 0.0-1.0] [--window <title>] [--active]

# Locate text using OCR within window
uvx desktop-agent screen locate-text-coordinates <text> [--window <title>] [--active]
uvx desktop-agent screen read-all-text [--window <title>] [--active]

# Utility commands
uvx desktop-agent screen pixel <x> <y>
uvx desktop-agent screen size
uvx desktop-agent screen on-screen <x> <y>

Examples:

# Screenshot of active window
uvx desktop-agent screen screenshot active.png --active

# Screenshot of a specific application
uvx desktop-agent screen screenshot chrome.png --window "Google Chrome"

# Locate image within Notepad
uvx desktop-agent screen locate-center button.png --window "Notepad"

ð¬ Message Dialogs (`message`)

Display user interaction dialogs.

# Show alert
uvx desktop-agent message alert "<text>" [--title TITLE] [--button BUTTON]

# Show confirmation dialog
uvx desktop-agent message confirm "<text>" [--title TITLE] [--buttons "OK,Cancel"]

# Prompt for input
uvx desktop-agent message prompt "<text>" [--title TITLE] [--default TEXT]

# Password input
uvx desktop-agent message password "<text>" [--title TITLE] [--mask CHAR]

Examples:

# Simple alert
uvx desktop-agent message alert "Task completed!"

# Get user confirmation
uvx desktop-agent message confirm "Continue with operation?"

# Ask for user input
uvx desktop-agent message prompt "Enter your name:"

ð± Application Control (`app`)

Control applications across Windows, macOS, and Linux.

# Open an application by name
uvx desktop-agent app open <name> [--arg ARGS...]

# Focus on a window by title/name
uvx desktop-agent app focus <name>

# List all visible windows
uvx desktop-agent app list

Examples:

# Windows: Open Notepad
uvx desktop-agent app open notepad

# Windows: Open Chrome with a URL
uvx desktop-agent app open "chrome" --arg "https://google.com"

# macOS: Open Safari
uvx desktop-agent app open "Safari"

# Focus on a specific window
uvx desktop-agent app focus "Untitled - Notepad"

# List all open windows
uvx desktop-agent app list

Common Automation Workflows

Workflow 1: Open Application and Type

# Open notepad directly (cross-platform)
uvx desktop-agent app open notepad

# Wait for app to open, then focus it
uvx desktop-agent app focus notepad

# Type some text
uvx desktop-agent keyboard write "Hello from Desktop Skill!"

Workflow 2: Screenshot + Analysis

# Get screen size first
uvx desktop-agent screen size

# Take full screenshot
uvx desktop-agent screen screenshot current_screen.png

# Check if specific UI element is visible
uvx desktop-agent screen locate save_button.png

Workflow 3: Form Filling

# Click first field
uvx desktop-agent mouse click 300 200

# Fill field
uvx desktop-agent keyboard write "John Doe"

# Tab to next field
uvx desktop-agent keyboard press tab

# Fill second field
uvx desktop-agent keyboard write "john@example.com"

# Submit form (Enter)
uvx desktop-agent keyboard press enter

Workflow 4: Copy/Paste Operations

# Select all text
uvx desktop-agent keyboard hotkey "ctrl,a"

# Copy
uvx desktop-agent keyboard hotkey "ctrl,c"

# Click destination
uvx desktop-agent mouse click 500 600

# Paste
uvx desktop-agent keyboard hotkey "ctrl,v"

Safety Considerations

When using this skill, AI agents should:

Verify coordinates: Use screen size and on-screen before clicking
Add delays: Insert appropriate delays between commands for UI responsiveness
Validate images: Ensure image files exist before using locate commands
Handle failures: Commands may fail if windows change or elements move
User safety: Always confirm destructive actions with user via message confirm

Troubleshooting

PyAutoGUI Fail-Safe

PyAutoGUI has a fail-safe: moving mouse to screen corner aborts operations. This is a safety feature.

Image not found

When using screen locate, ensure:

Image file exists and path is correct
Adjust --confidence (try 0.7-0.9)
Image matches exact screen appearance (resolution, colors)

Getting Help

# Show all available commands
uvx desktop-agent --help

# Show commands for specific category
uvx desktop-agent mouse --help
uvx desktop-agent keyboard --help
uvx desktop-agent screen --help
uvx desktop-agent message --help

# Show help for specific command
uvx desktop-agent mouse move --help

Integration Tips for AI Agents

Always check screen size first when working with absolute coordinates
Use relative positioning when possible (e.g., get current position, calculate offset)
Combine commands for complex workflows
Validate before executing (e.g., check if image exists on screen)
Provide user feedback using message dialogs for important operations
Handle errors gracefully – commands may fail if UI state changes

Performance Notes

Mouse movements with --duration are animated and take time
Image location (locate) can be slow on large screens – use regions when possible
Keyboard commands are generally fast (< 100ms)
Screenshots depend on screen resolution and region size

Output Format

All commands output structured JSON by default, ideal for programmatic use by AI agents:

uvx desktop-agent mouse position
# Output: {"success": true, "command": "mouse.position", "timestamp": "2026-01-31T10:00:00Z", "duration_ms": 5, "data": {"position": {"x": 960, "y": 540}}}

Response Schema

All JSON responses follow this schema:

{
  "success": true,
  "command": "category.command",
  "timestamp": "2026-01-31T10:00:00Z",
  "duration_ms": 150,
  "data": { ... },
  "error": null
}

Error Response Schema

{
  "success": false,
  "command": "category.command",
  "timestamp": "2026-01-31T10:00:00Z",
  "duration_ms": 50,
  "data": null,
  "error": {
    "code": "image_not_found",
    "message": "Image file 'button.png' not found",
    "details": {},
    "recoverable": true
  }
}

Error Codes

Code	Description
`success`	Command succeeded
`invalid_argument`	Invalid command arguments
`coordinates_out_of_bounds`	Coordinates outside screen
`image_not_found`	Image file not found or not on screen
`window_not_found`	Target window not found
`ocr_failed`	OCR operation failed
`application_not_found`	Application not found
`permission_denied`	Permission denied
`platform_not_supported`	Platform not supported
`timeout`	Operation timed out
`unknown_error`	Unknown error

Mouse move:

uvx desktop-agent mouse move 960 540

{"success": true, "command": "mouse.move", "timestamp": "...", "duration_ms": 150, "data": {"x": 960, "y": 540, "duration": 0}, "error": null}

Screen size:

uvx desktop-agent screen size

{"success": true, "command": "screen.size", "timestamp": "...", "duration_ms": 5, "data": {"size": {"width": 1920, "height": 1080}}, "error": null}

Locate image:

uvx desktop-agent screen locate button.png

{"success": true, "command": "screen.locate", "timestamp": "...", "duration_ms": 250, "data": {"image_found": true, "bounding_box": {"left": 100, "top": 200, "width": 50, "height": 30, "center_x": 125, "center_y": 215}}, "error": null}

List windows:

uvx desktop-agent app list

{"success": true, "command": "app.list", "timestamp": "...", "duration_ms": 100, "data": {"windows": ["Untitled - Notepad", "Google Chrome", "Visual Studio Code"]}, "error": null}

Error example:

uvx desktop-agent screen locate missing.png

{"success": false, "command": "screen.locate", "timestamp": "...", "duration_ms": 50, "data": null, "error": {"code": "image_not_found", "message": "Image file 'missing.png' not found", "details": {}, "recoverable": true}}

Effective Usage Guide for AI Agents

This section teaches AI agents how to use this skill effectively with optimal command sequences and best practices.

ð¯ Core Strategy: Observe First, Then Act

Always understand the current state before performing actions. This avoids clicking wrong coordinates or typing in the wrong window.

Recommended Initial Sequence:

# 1. Get screen dimensions to understand your workspace
uvx desktop-agent screen size
uvx desktop-agent app list
uvx desktop-agent mouse position

ð Recommended Command Sequences by Task

Open and Interact with Application

# â CORRECT: Open, wait, verify, then interact
uvx desktop-agent app open notepad              # Step 1: Open app
uvx desktop-agent app list
uvx desktop-agent app focus "Notepad"
uvx desktop-agent keyboard write "Hello World"  # Step 4: Now safe to type

# â WRONG: Type immediately without verification
uvx desktop-agent app open notepad
uvx desktop-agent keyboard write "Hello World"  # May type in wrong window!

Find and Click UI Element (Image-Based)

# â CORRECT: Locate first, click if found
uvx desktop-agent screen locate-center button.png --confidence 0.8
# Check if success=true and coordinates are valid
uvx desktop-agent mouse click 125 215  # Use returned coordinates

# â WRONG: Click without verifying element exists
uvx desktop-agent mouse click 125 215  # Might click wrong area!

Find and Click UI Element (Text-Based with OCR)

# â CORRECT: Read screen text, then locate specific text
uvx desktop-agent screen read-all-text --active
uvx desktop-agent screen locate-text-coordinates "Save" --active
# Use returned coordinates to click

# For window-specific OCR:
uvx desktop-agent screen locate-text-coordinates "OK" --window "Dialog Title"

Fill a Form with Multiple Fields

# â CORRECT: Click each field explicitly before typing
uvx desktop-agent mouse click 300 200           # Click first field
uvx desktop-agent keyboard write "John Doe"
uvx desktop-agent mouse click 300 250           # Click second field (more reliable)
uvx desktop-agent keyboard write "john@example.com"
uvx desktop-agent mouse click 300 300           # Click third field
uvx desktop-agent keyboard write "555-1234"

# OR use Tab navigation (less reliable if field order changes)
uvx desktop-agent mouse click 300 200
uvx desktop-agent keyboard write "John Doe"
uvx desktop-agent keyboard press tab
uvx desktop-agent keyboard write "john@example.com"
uvx desktop-agent keyboard press tab
uvx desktop-agent keyboard write "555-1234"
uvx desktop-agent keyboard press enter          # Submit

Take Targeted Screenshots for Analysis

# â CORRECT: Screenshot specific windows for faster processing
uvx desktop-agent app list --json                           # Find exact window title
uvx desktop-agent screen screenshot app.png --window "Google Chrome"

# For active window only
uvx desktop-agent screen screenshot active.png --active

# Full screen only when necessary (slower, larger file)
uvx desktop-agent screen size
uvx desktop-agent screen screenshot full.png

Safe Drag and Drop

# â CORRECT: Move to start, verify position, then drag
uvx desktop-agent mouse move 100 200                 # Move to source
uvx desktop-agent mouse position              # Verify position
uvx desktop-agent mouse drag 500 400 --duration 0.5  # Drag to destination

# For precision, use slower duration
uvx desktop-agent mouse drag 500 400 --duration 1.0

ð Error Recovery Patterns

When Window Not Found

# Pattern: List windows, find closest match, retry
uvx desktop-agent app focus "Chrome"             # Fails with window_not_found
uvx desktop-agent app list                # See actual window titles
# Output shows: "Google Chrome - My Page"
uvx desktop-agent app focus "Google Chrome"      # Use correct title

When Image Not Found

# Pattern: Adjust confidence or take new screenshot
uvx desktop-agent screen locate button.png --confidence 0.9
uvx desktop-agent screen locate button.png --confidence 0.7
# If still failing, capture current state for analysis
uvx desktop-agent screen screenshot current.png --active

When Click Seems to Miss

# Pattern: Verify coordinates are on screen
uvx desktop-agent screen size             # Get screen bounds
uvx desktop-agent screen on-screen 1500 900      # Check if coords are valid
uvx desktop-agent mouse move 1500 900            # Move first to visualize
uvx desktop-agent mouse click                    # Then click at current position

â¡ Performance Optimization

Minimize Screenshots

# â GOOD: Screenshot only the region you need
uvx desktop-agent screen screenshot button_area.png --region "100,200,200,100"

# â GOOD: Screenshot specific window instead of full screen  
uvx desktop-agent screen screenshot chrome.png --window "Google Chrome"

# â SLOW: Full screen capture when you only need a small area
uvx desktop-agent screen screenshot full.png

Batch Keyboard Input

# â FASTER: Write entire text at once
uvx desktop-agent keyboard write "This is a complete sentence with all the text."

# â SLOWER: Multiple write commands
uvx desktop-agent keyboard write "This is "
uvx desktop-agent keyboard write "a complete "
uvx desktop-agent keyboard write "sentence."

Use Hotkeys Over Mouse When Possible

# â FASTER: Use keyboard shortcuts
uvx desktop-agent keyboard hotkey "ctrl,s"       # Save
uvx desktop-agent keyboard hotkey "ctrl,a"       # Select all
uvx desktop-agent keyboard hotkey "ctrl,shift,s" # Save as

# â SLOWER: Navigate menu with mouse
uvx desktop-agent mouse click 50 30              # Click File menu
uvx desktop-agent mouse click 60 80              # Click Save option

ð¡ï¸ Defensive Programming Patterns

Always Verify Critical Actions

# Before destructive action, confirm with user
uvx desktop-agent message confirm "This will delete all files. Continue?" --title "Warning"
# Check output: if "Cancel" was clicked, abort operation

Use JSON Mode for Reliable Parsing

# â RELIABLE: Parse structured JSON output
uvx desktop-agent screen locate button.png
# Parse: {"success": true, "data": {"center_x": 125, "center_y": 215}}

# â FRAGILE: Parse text output
uvx desktop-agent screen locate button.png
# Parse: "Found at: Box(left=100, top=200, width=50, height=30)"

Validate Before Multi-Step Operations

# Multi-step file operation with validation
uvx desktop-agent app list
uvx desktop-agent screen locate-text-coordinates "File" --active
uvx desktop-agent mouse click <returned_x> <returned_y>
uvx desktop-agent screen locate-text-coordinates "Save As" --active
uvx desktop-agent mouse click <returned_x> <returned_y>

ð® Platform-Specific Considerations

Windows

# Common Windows shortcuts
uvx desktop-agent keyboard hotkey "win,d"        # Show desktop
uvx desktop-agent keyboard hotkey "win,e"        # Open Explorer
uvx desktop-agent keyboard hotkey "alt,tab"      # Switch windows
uvx desktop-agent keyboard hotkey "win,r"        # Run dialog

# Open apps by name
uvx desktop-agent app open notepad
uvx desktop-agent app open calc
uvx desktop-agent app open mspaint

macOS

# Common macOS shortcuts (use 'command' for Cmd key)
uvx desktop-agent keyboard hotkey "command,space"   # Spotlight
uvx desktop-agent keyboard hotkey "command,tab"     # App switcher
uvx desktop-agent keyboard hotkey "command,q"       # Quit app
uvx desktop-agent keyboard hotkey "command,shift,3" # Screenshot

# Open apps
uvx desktop-agent app open "Safari"
uvx desktop-agent app open "TextEdit"

Linux

# Open apps (uses xdg-open or direct command)
uvx desktop-agent app open firefox
uvx desktop-agent app open gedit

# Common shortcuts may vary by DE
uvx desktop-agent keyboard hotkey "alt,f2"       # Run dialog (many DEs)

ð Decision Tree: Choosing the Right Command

 Want to interact with an app?
âââ App not running â `app open <name>`
âââ App running but not focused â `app focus <name>` 
âââ Need to verify windows â `app list`

Want to find a UI element?
âââ Have reference image â `screen locate-center <image>`
âââ Know the text label â `screen locate-text-coordinates "<text>"`
âââ Need to see all text â `screen read-all-text --active`

Want to click something?
âââ Know exact coordinates â `mouse click <x> <y>`
âââ Need to find first â Use locate commands above, then click returned coords
âââ Not sure if on screen â `screen on-screen <x> <y>` first

Want to type something?
âââ Regular text â `keyboard write "<text>"`
âââ Keyboard shortcut â `keyboard hotkey "<key1>,<key2>"`
âââ Single key press â `keyboard press <key>`
âââ Multiple of same key â `keyboard press <key> --presses N`

Integration Tips for AI Agents

Always check screen size first when working with absolute coordinates
Use relative positioning when possible (e.g., get current position, calculate offset)
Combine commands for complex workflows
Validate before executing (e.g., check if image exists on screen)
Provide user feedback using message dialogs for important operations
Handle errors gracefully – commands may fail if UI state changes

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台

desktop control

Skill 文档

Desktop Control Skill

How to Use This Skill

Command Structure

Available Commands

ð±ï¸ Mouse Control (mouse)

â¨ï¸ Keyboard Control (keyboard)

ð¼ï¸ Screen & Screenshots (screen)

ð¬ Message Dialogs (message)

ð± Application Control (app)

Common Automation Workflows

Workflow 1: Open Application and Type

Workflow 2: Screenshot + Analysis

Workflow 3: Form Filling

Workflow 4: Copy/Paste Operations

Safety Considerations

Troubleshooting

PyAutoGUI Fail-Safe

Image not found

Getting Help

Integration Tips for AI Agents

Performance Notes

Output Format

Response Schema

Error Response Schema

Error Codes

Effective Usage Guide for AI Agents

ð¯ Core Strategy: Observe First, Then Act

ð Recommended Command Sequences by Task

Open and Interact with Application

Find and Click UI Element (Image-Based)

Find and Click UI Element (Text-Based with OCR)

Fill a Form with Multiple Fields

Take Targeted Screenshots for Analysis

Safe Drag and Drop

ð Error Recovery Patterns

When Window Not Found

When Image Not Found

When Click Seems to Miss

â¡ Performance Optimization

Minimize Screenshots

Batch Keyboard Input

Use Hotkeys Over Mouse When Possible

ð¡ï¸ Defensive Programming Patterns

Always Verify Critical Actions

Use JSON Mode for Reliable Parsing

Validate Before Multi-Step Operations

ð® Platform-Specific Considerations

Windows

macOS

Linux

ð Decision Tree: Choosing the Right Command

Integration Tips for AI Agents

ð±ï¸ Mouse Control (`mouse`)

â¨ï¸ Keyboard Control (`keyboard`)

ð¼ï¸ Screen & Screenshots (`screen`)

ð¬ Message Dialogs (`message`)

ð± Application Control (`app`)

ð¯ Core Strategy: Observe First, Then Act

ð Recommended Command Sequences by Task

ð Error Recovery Patterns

â¡ Performance Optimization

ð¡ï¸ Defensive Programming Patterns

ð® Platform-Specific Considerations

ð Decision Tree: Choosing the Right Command