web-research-workflow

📁 yonatangross/orchestkit 📅 7 days ago

总安装量

周安装量

#11345

全站排名

安装命令

npx skills add https://github.com/yonatangross/orchestkit --skill web-research-workflow

Agent 安装分布

claude-code 4

opencode 3

github-copilot 3

gemini-cli 3

replit 2

Skill 文档

Web Research Workflow

Unified approach for web content research that automatically selects the right tool for each situation.

Quick Decision Tree

URL to research
     â
     â¼
âââââââââââââââââââ
â 1. Try WebFetch â â Fast, free, no overhead
â    (always try) â
âââââââââââââââââââ
     â
Content OK? ââYesâââº Parse and return
     â
     No (empty/partial/<500 chars)
     â
     â¼
âââââââââââââââââââââââââ
â 2. TAVILY_API_KEY set?â
âââââââââââââââââââââââââ
     â          â
    Yes         No âââº Skip to step 3
     â
     â¼
âââââââââââââââââââââââââââââ
â Tavily search/extract/    â â Raw markdown, batch URLs
â crawl/research            â
âââââââââââââââââââââââââââââ
     â
Content OK? ââYesâââº Parse and return
     â
     No (JS-rendered/auth-required)
     â
     â¼
âââââââââââââââââââââââ
â 3. Use agent-browser â â Full browser, last resort
âââââââââââââââââââââââ
     â
ââ SPA (react/vue/angular) âââº wait --load networkidle
ââ Login required âââº auth flow + state save
ââ Dynamic content âââº wait --text "Expected"
ââ Multi-page âââº crawl pattern

Tavily Enhanced Research

When TAVILY_API_KEY is set, Tavily provides a powerful middle tier between WebFetch and agent-browser. It returns raw markdown content, supports batch URL extraction, and offers semantic search with relevance scoring.

When to use Tavily over WebFetch:

WebFetch returned <500 chars (likely incomplete)
You need raw markdown content (not Haiku-summarized)
Batch extracting content from multiple URLs
Semantic search with relevance scoring
Site discovery/crawling (map API)

When to skip Tavily and go to agent-browser:

Content requires JavaScript rendering (SPAs)
Authentication/login is required
Interactive elements need clicking
Content is behind CAPTCHAs

Tavily Search (Semantic Web Search)

Returns relevance-scored results with raw markdown content:

curl -s -X POST 'https://api.tavily.com/search' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TAVILY_API_KEY" \
  -d '{
    "query": "your search query",
    "search_depth": "advanced",
    "max_results": 5,
    "include_raw_content": "markdown"
  }' | python3 -m json.tool

Options:

search_depth: "basic" (fast) or "advanced" (thorough, 2x cost)
topic: "general" (default) or "news" or "finance"
include_domains: ["example.com"] â restrict to specific sites
exclude_domains: ["reddit.com"] â filter out sites
days: 3 â limit to recent results (news/finance)
include_raw_content: "markdown" â get full page content

Tavily Extract (Batch URL Content)

Extract raw content from up to 20 URLs at once:

curl -s -X POST 'https://api.tavily.com/extract' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TAVILY_API_KEY" \
  -d '{
    "urls": [
      "https://docs.example.com/guide",
      "https://competitor.com/pricing"
    ]
  }' | python3 -m json.tool

Returns markdown content for each URL. Use when you have specific URLs and need full content.

Tavily Map (Site Discovery)

Discover all URLs on a site before extracting:

curl -s -X POST 'https://api.tavily.com/map' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TAVILY_API_KEY" \
  -d '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "limit": 50
  }' | python3 -m json.tool

Useful for documentation sites and competitor sitemaps. Combine with extract for full crawl.

Tavily Crawl (Multi-Page Content Extraction)

Crawl an entire site and extract content from all discovered pages in one call. Replaces the manual mapâextract two-step workflow:

curl -s -X POST 'https://api.tavily.com/crawl' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TAVILY_API_KEY" \
  -d '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "limit": 50,
    "include_raw_content": "markdown"
  }' | python3 -m json.tool

Options:

max_depth: 2 â how many link-hops from the seed URL
limit: 50 â max pages to crawl
include_raw_content: "markdown" â get full page content (not just snippets)
exclude_paths: ["/blog/*"] â skip certain URL patterns
include_paths: ["/docs/*"] â restrict to certain URL patterns

When to use Crawl vs Map+Extract:

Use Crawl when you want content from an entire site section (docs, changelog, pricing)
Use Map when you only need URL discovery without content
Use Extract when you already have specific URLs

Tavily Research (Deep Research â Beta)

Multi-step research agent that searches, reads, and synthesizes across multiple sources. Returns a comprehensive report with citations:

curl -s -X POST 'https://api.tavily.com/research' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $TAVILY_API_KEY" \
  -d '{
    "query": "comparison of vector databases for RAG in 2026",
    "max_results": 10,
    "report_type": "research_report"
  }' | python3 -m json.tool

Options:

report_type: "research_report" (default) | "outline_report" | "detailed_report"
max_results: number of sources to analyze (more = deeper but slower)

Note: The /research endpoint is in beta. Falls back gracefully to /search + /extract if unavailable. Best for deep competitive analysis, market research, and technical comparisons.

Escalation Heuristic

# Auto-escalate from WebFetch to Tavily
CONTENT=$(WebFetch url="$URL" prompt="Extract main content")
CHAR_COUNT=${#CONTENT}

if [ "$CHAR_COUNT" -lt 500 ] && [ -n "$TAVILY_API_KEY" ]; then
  # WebFetch returned thin content â try Tavily extract
  RESULT=$(curl -s -X POST 'https://api.tavily.com/extract' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $TAVILY_API_KEY" \
    -d "{\"urls\":[\"$URL\"]}")
  # Parse result...
fi

Cost Awareness

API	Cost	Notes
Tavily Search	1 credit/search	`advanced` depth = 2 credits
Tavily Extract	1 credit/5 URLs	Batch up to 20 URLs
Tavily Map	1 credit/10 pages	Good for site discovery
Tavily Crawl	1 credit/5 pages	Full site extraction in one call
Tavily Research	5 credits/query	Deep multi-source synthesis (beta)
WebFetch	Free	Always try first
agent-browser	Free (compute)	Slowest, most capable

Graceful Degradation

If TAVILY_API_KEY is not set, the 3-tier tree collapses to the original 2-tier (WebFetch â agent-browser). No configuration needed â agents check for the env var before attempting Tavily calls.

When to Use What

Scenario	Tool	Why
Static HTML page	WebFetch	Fast, no browser needed
Public API docs	WebFetch	Usually server-rendered
GitHub README	WebFetch	Static content
Batch URL extraction (2-20 URLs)	Tavily Extract	Parallel, raw markdown
Semantic search with content	Tavily Search	Relevance-scored results
Site crawl/discovery (URLs only)	Tavily Map	Finds all URLs on a domain
Full site content extraction	Tavily Crawl	Crawl + extract in one call
Competitor page deep content	Tavily Extract	Full markdown, not summarized
Deep multi-source research	Tavily Research	Synthesized report with citations (beta)
React/Vue/Angular app	agent-browser	Needs JS execution
Interactive pricing page	agent-browser	Dynamic content
Login-protected content	agent-browser	Needs session state
Swagger UI	agent-browser	Client-rendered
Documentation with sidebar nav	agent-browser	Client-side routing

Pattern 1: Auto-Fallback

Try WebFetch first, fall back to browser if needed:

# Step 1: Try WebFetch
WebFetch(url="https://example.com", prompt="Extract main content")

# If result is empty, partial, or contains "Loading..." indicators:
# Step 2: Fall back to browser
agent-browser open https://example.com
agent-browser wait --load networkidle
agent-browser get text body

Detection Heuristics

Content likely needs browser if WebFetch returns:

Empty or very short content (< 500 chars)
Contains <noscript> tags
Contains “Loading…”, “Please wait”, “JavaScript required”
Contains only <div id="root"></div> or <div id="app"></div>
Returns 403/401 (may need auth)

Pattern 2: SPA Detection

Known patterns that always need browser:

# URL patterns suggesting SPA
app.* | dashboard.* | portal.* | console.*

# Framework indicators in initial HTML
"__NEXT_DATA__"     â Next.js (may work with WebFetch)
"window.__NUXT__"   â Nuxt.js (may work with WebFetch)
"ng-app"            â Angular (needs browser)
"data-reactroot"    â React (needs browser)
"data-v-"           â Vue (needs browser)

Pattern 3: Authentication Flow

For login-protected content:

# 1. Navigate to login
agent-browser open https://app.example.com/login
agent-browser snapshot -i

# 2. Fill credentials (use refs from snapshot)
agent-browser fill @e1 "$EMAIL"
agent-browser fill @e2 "$PASSWORD"
agent-browser click @e3  # Submit button

# 3. Wait for redirect
agent-browser wait --url "**/dashboard"

# 4. Save session for reuse
agent-browser state save /tmp/session-example.json

# 5. Later: restore session
agent-browser state load /tmp/session-example.json
agent-browser open https://app.example.com/protected-page

Pattern 4: Multi-Page Research

For documentation sites or multi-page content:

# 1. Get navigation links
agent-browser open https://docs.example.com
agent-browser snapshot -i

# 2. Extract all doc links
LINKS=$(agent-browser eval "JSON.stringify(
  Array.from(document.querySelectorAll('nav a'))
    .map(a => a.href)
    .filter(h => h.includes('/docs/'))
)")

# 3. Iterate with rate limiting
for link in $(echo "$LINKS" | jq -r '.[]' | head -20); do
  agent-browser open "$link"
  agent-browser wait --load networkidle
  agent-browser get text article > "/tmp/doc-$(basename $link).txt"
  sleep 2  # Rate limit
done

Best Practices

1. Always Try WebFetch First

# WebFetch is 10x faster and uses no browser resources
# Only fall back to browser when necessary

2. Use Appropriate Waits

# For SPAs with API calls
agent-browser wait --load networkidle

# For specific content
agent-browser wait --text "Expected content"

# For elements
agent-browser wait @e5

3. Respect Rate Limits

# Add delays between requests
sleep 2

# Use session isolation for parallel work
agent-browser --session site1 open https://site1.com
agent-browser --session site2 open https://site2.com

4. Cache Results

# Save extracted content to avoid re-scraping
agent-browser get text body > /tmp/cache/example-com.txt

Troubleshooting

Issue	Solution
Empty content from WebFetch	Try Tavily extract, then agent-browser
WebFetch returns <500 chars	Escalate to Tavily extract if API key set
Partial content	Use `wait --text "Expected"` for specific content
Need batch URL extraction	Tavily extract (up to 20 URLs at once)
403 Forbidden	May need authentication flow (agent-browser)
CAPTCHA	Manual intervention required
Rate limited	Add delays, reduce request frequency
Content in iframe	Use `agent-browser frame @e1` then extract
No TAVILY_API_KEY	Skip Tavily tier, use WebFetch â agent-browser

BrightData PRO_MODE (Optional Cloud Scraping)

When BRIGHTDATA_API_TOKEN is set and the BrightData MCP server is configured, PRO_MODE provides specialized tool groups for cloud-based scraping without running a local browser.

Tool Groups

Group	Use Case
`ecommerce`	Product pages, pricing, reviews (Amazon, Shopify, etc.)
`social`	Social media profiles, posts, metrics
`finance`	Financial data, stock quotes, SEC filings
`business`	Company info, job listings, reviews
`research`	Academic papers, patents, datasets
`browser`	General-purpose cloud browser (JS rendering)

Configuration

Set GROUPS or TOOLS env vars to control which tools are available:

# Enable specific groups
GROUPS=ecommerce,business npx @anthropic-ai/mcp brightdata

# Or enable specific tools
TOOLS=scraping_browser,web_data_amazon npx @anthropic-ai/mcp brightdata

Alternative: @brightdata/browserai-mcp

For serverless/cloud environments where agent-browser is unavailable:

npx @brightdata/browserai-mcp

Provides the same browsing capabilities but runs in BrightData’s cloud infrastructure. Use when:

No local browser available (CI/CD, serverless)
Need residential proxies for geo-restricted content
High-volume scraping that would trigger rate limits locally

Decision Tree: BrightData vs agent-browser

Scenario	Recommended
Local dev, occasional scraping	agent-browser
CI/CD pipeline, no browser	BrightData browserai-mcp
E-commerce price monitoring	BrightData PRO_MODE (ecommerce)
Geo-restricted content	BrightData (residential proxies)
High-volume parallel scraping	BrightData (cloud infrastructure)
Login-protected, custom flows	agent-browser (full control)

Integration with Agents

This skill is used by:

web-research-analyst – Primary user
market-intelligence – Competitor research
product-strategist – Deep competitive analysis
ux-researcher – Design system capture
documentation-specialist – API doc extraction

Related Skills

browser-content-capture – Detailed browser patterns
agent-browser – CLI reference
competitive-monitoring – Change tracking

Version: 1.2.0 (February 2026)

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台

web-research-workflow

Agent 安装分布

Skill 文档

Web Research Workflow

Quick Decision Tree

Tavily Enhanced Research

Tavily Search (Semantic Web Search)

Tavily Extract (Batch URL Content)

Tavily Map (Site Discovery)

Tavily Crawl (Multi-Page Content Extraction)

Tavily Research (Deep Research â Beta)

Escalation Heuristic

Cost Awareness

Graceful Degradation

When to Use What

Pattern 1: Auto-Fallback

Detection Heuristics

Pattern 2: SPA Detection

Pattern 3: Authentication Flow

Pattern 4: Multi-Page Research

Best Practices

1. Always Try WebFetch First

2. Use Appropriate Waits

3. Respect Rate Limits

4. Cache Results

Troubleshooting

BrightData PRO_MODE (Optional Cloud Scraping)

Tool Groups

Configuration

Alternative: @brightdata/browserai-mcp

Decision Tree: BrightData vs agent-browser

Integration with Agents

Related Skills

Tavily Research (Deep Research â Beta)