browser automation
npx skills add https://github.com/web-infra-dev/midscene-skills --skill Browser Automation
Skill 文档
Browser Automation
CRITICAL RULES â VIOLATIONS WILL BREAK THE WORKFLOW:
- NEVER set
run_in_background: trueon any Bash tool call for midscene commands. Everynpx @midscene/webcommand MUST userun_in_background: false(or omit the parameter entirely). Background execution causes notification spam after the task ends and breaks the screenshot-analyze-act loop.- Send only ONE midscene CLI command per Bash tool call. Wait for its result, read the screenshot, then decide the next action. Do NOT chain commands with
&&,;, orsleep.- Set
timeout: 60000(60 seconds) on each Bash tool call to allow sufficient time for midscene commands to complete synchronously.
Automate web browsing using npx @midscene/web. Launches a headless Chrome via Puppeteer that persists across CLI calls â no session loss between commands. Each CLI command maps directly to an MCP tool â you (the AI agent) act as the brain, deciding which actions to take based on screenshots.
When to Use
Use this skill when:
- The user wants to browse or navigate to a specific URL
- You need to scrape, extract, or collect data from websites
- You want to verify or test frontend UI behavior
- The user wants screenshots of web pages
If you need to preserve login sessions or work with the user’s existing browser tabs, use the Chrome Bridge Automation skill instead.
Prerequisites
The CLI automatically loads .env from the current working directory. Before first use, verify the .env file exists and contains the API key:
cat .env | grep MIDSCENE_MODEL_API_KEY | head -c 30
If no .env file or no API key, ask the user to create one. See Model Configuration for supported providers.
Do NOT run echo $MIDSCENE_MODEL_API_KEY â the key is loaded from .env at runtime, not from shell environment.
Commands
Connect to a Web Page
npx @midscene/web connect --url https://example.com
Take Screenshot
npx @midscene/web take_screenshot
After taking a screenshot, read the saved image file to understand the current page state before deciding the next action.
Perform Actions
Use actionSpace tools to interact with the page:
npx @midscene/web Tap --locate '{"prompt":"the Login button"}'
npx @midscene/web Input --locate '{"prompt":"the email field"}' --value 'user@example.com'
npx @midscene/web Scroll --direction down
npx @midscene/web Hover --locate '{"prompt":"the navigation menu"}'
npx @midscene/web KeyboardPress --value Enter
npx @midscene/web DragAndDrop --locate '{"prompt":"the draggable item"}' --target '{"prompt":"the drop zone"}'
Natural Language Action
Use act to execute multi-step operations in a single command â useful for transient UI interactions:
npx @midscene/web act --prompt "click the country dropdown and select Japan"
Disconnect
Disconnect from the page but keep the browser running:
npx @midscene/web disconnect
Close Browser
Close the browser completely when finished:
npx @midscene/web close
Workflow Pattern
The browser persists across CLI calls via a background Chrome process. Follow this pattern:
- Connect to a URL to open a new tab
- Take screenshot to see the current state
- Analyze the screenshot to decide the next action
- Execute action (Tap, Input, Scroll, etc.)
- Take screenshot again to verify the result
- Repeat steps 3-5 until the task is complete
- Close the browser when done (or disconnect to keep it for later)
Best Practices
- Always connect first: Navigate to the target URL with
connect --urlbefore any interaction. - Take screenshots frequently: Before and after each action to verify state changes.
- Be specific in locate prompts: Instead of
"the button", say"the blue Submit button in the contact form". - Use natural language: Describe what you see on the page, not CSS selectors. Say
"the red Buy Now button"instead of"#buy-btn". - Handle loading states: After navigation or actions that trigger page loads, take a screenshot to verify the page has loaded.
- Close when done: Use
closeto shut down the browser and free resources. - Never run in background: On every Bash tool call, either omit
run_in_backgroundor explicitly set it tofalse. Never setrun_in_background: true.
Handle Transient UI
Dropdowns, autocomplete popups, tooltips, and confirm dialogs disappear between commands. When interacting with transient UI:
- Use
actfor multi-step transient interactions â it executes everything in a single process - Or execute commands rapidly in sequence â do NOT take screenshots between steps
- Do NOT pause to analyze â run all commands for the transient interaction back-to-back
- Persistent UI (page content, navigation bars, sidebars) is fine to interact with across separate commands
Example â Dropdown selection using act (recommended for transient UI):
npx @midscene/web act --prompt "click the country dropdown and select Japan"
npx @midscene/web take_screenshot
Example â Dropdown selection using individual commands (alternative):
# These commands must be run back-to-back WITHOUT screenshots in between
npx @midscene/web Tap --locate '{"prompt":"the country dropdown"}'
npx @midscene/web Tap --locate '{"prompt":"Japan option in the dropdown list"}'
# NOW take a screenshot to verify the result
npx @midscene/web take_screenshot
Common Patterns
Simple Browsing
npx @midscene/web connect --url 'https://news.ycombinator.com'
npx @midscene/web take_screenshot
# Read the screenshot, then decide next action
npx @midscene/web close
Multi-Step Interaction
npx @midscene/web connect --url 'https://example.com'
npx @midscene/web Tap --locate '{"prompt":"the Sign In link"}'
npx @midscene/web take_screenshot
npx @midscene/web Input --locate '{"prompt":"the email field"}' --value 'user@example.com'
npx @midscene/web Input --locate '{"prompt":"the password field"}' --value 'password123'
npx @midscene/web Tap --locate '{"prompt":"the Log In button"}'
npx @midscene/web take_screenshot
npx @midscene/web close
Frontend Verification
npx @midscene/web connect --url 'http://localhost:3000'
npx @midscene/web take_screenshot
# Analyze: verify login form is visible
npx @midscene/web Input --locate '{"prompt":"the email field"}' --value 'test@example.com'
npx @midscene/web Input --locate '{"prompt":"the password field"}' --value 'password'
npx @midscene/web Tap --locate '{"prompt":"the Submit button"}'
npx @midscene/web take_screenshot
# Analyze: verify the welcome message is displayed
npx @midscene/web close
Data Extraction
npx @midscene/web connect --url 'https://example.com/products'
npx @midscene/web take_screenshot
# Read the screenshot to extract product names, prices, and ratings
npx @midscene/web close
Frontend Verification Workflow
When asked to verify or test a frontend application:
- Start the dev server if not already running (e.g.,
npm run dev). - Connect to the local URL (e.g.,
http://localhost:3000). - Take a screenshot to see the initial state.
- Analyze the screenshot to verify expected UI elements are present.
- Perform interactions (Tap, Input, Scroll) to test user flows.
- Take screenshots after each step to verify outcomes.
- Close the browser when finished.
Troubleshooting
Connection Failures
- Ensure Chrome/Chromium is installed on the system (Puppeteer downloads its own by default).
- Check that no firewall blocks local Chrome debugging ports.
API Key Errors
- Check
.envfile containsMIDSCENE_MODEL_API_KEY=<your-key>. - Verify the key is valid for the configured model provider.
Timeouts
- Web pages may take time to load. After connecting, take a screenshot to verify readiness before interacting.
- For slow pages, wait briefly between steps.
Screenshots Not Displaying
- The screenshot path is an absolute path to a local file. Use the Read tool to view it.