niche-signal-discovery
npx skills add https://github.com/getaero-io/gtm-eng-skills --skill niche-signal-discovery
Agent 安装分布
Skill 文档
Niche Signal Discovery
Discover differential signals between Closed Won and Closed Lost accounts by extracting multi-page website content and job listings, then computing Laplace-smoothed lift scores to identify what distinguishes buyers from non-buyers.
Prerequisites
Required:
- Deepline CLI â All enrichment runs through
deepline enrich(no separate API keys for exa, crustdata, etc.) - Python 3 â Analysis script uses standard library only (csv, json, re, argparse â no pip packages)
NOT required:
- â No exa API key (accessed via Deepline)
- â No crustdata API key (accessed via Deepline)
- â No external Python packages (no pip install needed)
- â No OpenAI/Anthropic API keys (deterministic analysis, no LLM calls)
Credits: Enrichment consumes Deepline credits (~6 credits/company for exa + crustdata). Always get user approval before running paid enrichment.
Deepline-First Principle
Always use Deepline CLI (deepline enrich, deepline tools, deepline playground) for enrichment, data extraction, and batch operations. Deepline provides:
- Batch enrichment with automatic CSV column management
- Built-in tool integrations (exa_search, crustdata, apollo, etc.)
- Playground UI for inspecting and iterating on enrichment results
- Idempotent reruns â re-running updates existing columns instead of duplicating
Use deepline enrich for all enrichment steps. Use deepline tools execute for one-off tool calls. Use deepline playground for inspecting results. Refer to the gtm-meta-skill for Deepline command patterns and provider playbooks.
Input Requirements
- Won and lost customer domain lists (minimum 20 won + 10 lost for statistical significance)
- Target company context â What they sell, who they sell to, key personas (discovered in Step 0)
Pipeline
0. Discover target company (what they sell, who they sell to, differentiation)
0.5. Discover ecosystem (competitors, tech stack category, buyer personas)
1. Prepare input CSV (domain, status)
1.5. Generate vertical-specific configs (keywords, tools, job roles)
2. Multi-page website extraction + job listings (Deepline enrichment)
3. Quality gate â verify coverage (>80% with content)
3.5. Review generated configs (validate against enriched data)
4. Run differential analysis (scripts/analyze_signals.py)
5. Generate report (using references/report-template.md)
6. Review with signal interpretation rules (references/signal-interpretation.md)
Step 0: Target Company Discovery
CRITICAL: Do this FIRST before any enrichment or config generation.
Use web search to understand what the target company sells and who they sell to:
# Example for Air Inc
WebSearch: "air.inc product what do they sell"
WebSearch: "air.inc customers use cases who uses"
WebFetch: https://air.inc/ "What does Air.inc sell? Who are their target customers?"
Document the following:
- Product category â e.g., “Creative Operations & DAM platform”, “AR automation software”, “Sales engagement platform”
- Target buyer persona â e.g., “Marketing and creative teams”, “Finance/AR teams”, “Sales Development Reps”
- Key differentiation â What makes them unique vs competitors
- Example customers (if available) â Who currently uses the product
Why this matters: The entire pipeline (exa query, keywords, tech stack, job roles) adapts based on this discovery. Skipping this step results in generic/irrelevant signals.
Step 0.5: Ecosystem Discovery
Use web search to discover the competitive landscape and buyer ecosystem:
Competitor Discovery:
WebSearch: "{product category} software companies"
WebSearch: "{product category} alternatives competitors"
Example for Air Inc (creative ops/DAM):
WebSearch: "creative operations software digital asset management platforms"
WebSearch: "DAM software alternatives Bynder Widen competitors"
Tech Stack Discovery:
WebSearch: "{buyer persona} common tools tech stack"
WebSearch: "{industry} companies use what software"
Example for Air Inc (marketing/creative teams):
WebSearch: "marketing teams creative teams common tools software stack"
WebSearch: "creative operations teams use Figma Adobe tools integrations"
Job Role Discovery:
WebSearch: "{buyer persona} job titles roles responsibilities"
WebSearch: "companies hiring {buyer persona} job listings"
Example for Air Inc:
WebSearch: "creative operations job titles creative director content manager roles"
WebSearch: "companies hiring creative operations manager brand manager"
Document findings:
- Competitor products (3-5 names) â Used to identify migration opportunities (companies currently using these tools)
- Common tech stack tools (10-15 by category) â Used for infrastructure signals
- Buyer job titles (10-15 variations) â Used for hiring intent signals
Step 1: Prepare Input CSV
Create output/{company}-icp-input.csv:
domain,status
customer1.com,won
customer2.com,won
non-customer1.com,lost
non-customer2.com,lost
Step 1.5: Generate Vertical-Specific Configs
Using the discovery from Steps 0 and 0.5, create three JSON config files.
See references/keyword-catalog.md for JSON format and generation guidance.
# Create config files in output/{company}/
output/{company}-keywords.json # keyword categories
output/{company}-tools.json # tech stack tools by category
output/{company}-job-roles.json # job role categories
Generation approach:
-
Keywords â Mix of:
- Product category terms (e.g., “creative ops”, “asset management” for Air Inc)
- Buyer pain points (e.g., “fragmented tools”, “content discovery”)
- Competitor names for migration segment (e.g., “bynder”, “widen”) â companies using these are migration opportunities, not excluded
- Generic business maturity (e.g., “security”, “compliance”, “integrations”)
-
Tech Stack â Tools from ecosystem discovery:
- Infrastructure tools buyer persona uses (e.g., “figma”, “adobe creative cloud”)
- Adjacent tools in buyer workflow (e.g., “contentful”, “monday.com”)
- Competitor tools for migration segment â companies using these represent displacement opportunities
-
Job Roles â Titles from role discovery:
- Buyer persona variations (e.g., “creative director”, “content manager”, “brand manager”)
- Adjacent roles (e.g., “marketing operations”, “brand designer”)
- Generic growth roles (e.g., “product”, “engineering”)
Example generation for Air Inc (creative ops/DAM):
// keywords.json
{
"creative_operations": [
"creative ops", "creative operations", "asset management", "DAM",
"content library", "brand guidelines", "creative workflow", "creative review"
],
"buyer_pain_points": [
"fragmented tools", "content discovery", "version control",
"creative approval", "asset organization", "brand consistency"
],
"business_model": [
"contact sales", "plan", "enterprise", "pricing", "demo",
"subscription", "free trial"
],
"company_maturity": [
"compliance", "security", "integration", "api", "sso", "soc 2"
],
"anti_fit": [
"bynder", "widen", "brandfolder", "canto", "dam platform"
]
}
// tools.json
{
"creative_design": [
"figma", "sketch", "adobe creative cloud", "canva", "invision"
],
"marketing_ops": [
"hubspot", "marketo", "contentful", "wordpress", "webflow"
],
"project_management": [
"monday.com", "asana", "jira", "clickup", "notion"
],
"video_production": [
"frame.io", "vimeo", "wistia", "loom"
]
}
// job-roles.json
{
"creative_leadership": ["creative director", "head of creative", "vp creative"],
"content_management": ["content manager", "content director", "brand manager"],
"creative_ops": ["creative operations", "creative ops manager", "brand operations"],
"marketing_ops": ["marketing operations", "marops", "marketing ops manager"],
"design": ["product designer", "brand designer", "visual designer"],
"marketing_general": ["marketing manager", "demand gen", "growth marketing"]
}
Validation: Do the generated configs match the target’s vertical and buyer persona? If not, refine based on Step 0/0.5 findings.
Step 2: Deepline Enrichment
CRITICAL: Never scrape just the homepage. Use exa_search with contents.text to discover AND scrape ~8 pages per domain in a single API call.
Generate exa query dynamically based on target’s product category:
# Generic query (works for most B2B SaaS)
QUERY="company product features integrations customers security pricing careers about case-studies"
# Vertical-specific adjustments:
# - Creative/marketing tools: add "portfolio", "use cases", "creative workflow"
# - Finance tools: add "compliance", "reporting", "accounting integrations"
# - Sales tools: add "playbooks", "outbound", "pipeline"
# - Developer tools: add "documentation", "api reference", "github"
Example for Air Inc (creative ops):
deepline enrich \
--input output/air-inc-icp-input.csv \
--output output/air-inc-enriched.csv \
--with 'website=exa_search:{"query":"company product features portfolio use cases creative workflow customers integrations security pricing careers about","numResults":8,"type":"auto","includeDomains":["{{domain}}"],"contents":{"text":{"maxCharacters":3000,"verbosity":"compact","includeSections":["body"]}}}' \
--with 'jobs=crustdata_job_listings:{"companyDomains":"{{domain}}","limit":50}' \
--json
Why exa_search with contents (not parallel_extract)?
- Discovers pages AND returns content in one call â no separate page discovery step
- ~8 pages per domain, ~3K chars each = ~24K chars per company
- Cost: ~5 credits/company (exa) + ~1 credit (crustdata jobs) = ~6 credits/company
Get user credit approval before running. Example: “60 companies x 6 credits = ~360 credits.”
Step 3: Quality Gate
After enrichment, verify before proceeding:
- Coverage: >80% of companies should have website content. If <80%, check domain spelling and retry failed rows.
- Content depth: Average should be 6-8 pages per company, 12-20K chars.
- Job listings: Won companies should have more job data than lost (expected â larger/scaling companies win more).
If coverage is poor, re-run failed domains with --rows targeting specific rows.
Step 3.5: Review Generated Configs
Before running analysis, spot-check the generated configs against enriched data:
# Sample a few enriched companies
deepline playground output/{company}-enriched.csv
# In playground UI, check:
# - Do website pages mention the keywords in keywords.json?
# - Do job listings mention the roles in job-roles.json?
# - Do integrations/tech stack pages mention the tools in tools.json?
Red flags:
- Keywords appear in <10% of enriched companies â too niche, broaden
- Keywords appear in >90% of companies â too generic, refine
- Product category keywords (what the target SELLS) appear frequently in Won companies â wrong product category, these companies are competitors not buyers
- Job roles missing from actual job listings â wrong buyer persona, revise
Fix and re-generate configs if needed.
Step 4: Differential Analysis
Run the analysis script with the config files:
python3 scripts/analyze_signals.py \
--input output/{company}-enriched.csv \
--keywords output/{company}-keywords.json \
--tools output/{company}-tools.json \
--job-roles output/{company}-job-roles.json \
--output output/{company}-analysis.json
The script auto-detects __dl_full_result__ columns for website and jobs data. Override with --website-col N --jobs-col N if needed.
What the script computes:
- Substring-match presence per company per keyword (NOT exact-token TF-IDF)
- Laplace-smoothed lift:
((won + 0.5) / (won_total + 1)) / ((lost + 0.5) / (lost_total + 1)) - Source breakdown per keyword (website only / jobs only / both)
- Tech stack tool mentions across categories
- Job role prevalence in won vs lost
- Anti-fit signals (lift < 0.5x)
- Source evidence: exact quotes (±40 chars) with page URLs and job listing URLs per keyword
Step 5: Report Generation
Read references/report-template.md for the full report structure and quality rules.
Key quality rules:
- Raw counts always:
15% (6)not just15% - Sample sizes in headers:
Won (n=37),Lost (n=18) - Bold only lift > 2x
- Source evidence required: After each keyword table, add exact quotes with linked sources for top 3 keywords. The analysis script outputs
evidenceper keyword withcompany,source_type,quote,url, andpage_title/job_title. Format:> **Evidence â "keyword":** > - [company.com](url) (page title): "...exact quote with keyword..." > - [company.com](url) (job: "Job Title"): "...exact quote from listing..." - Source breakdown for hiring signals: Add a Source column showing
3w / 20j / 2bothformat - Niche tech stack tools: Report specific SaaS tools by category, not generic keywords
- Anti-fit signals in separate section
- Interpretation column required: Explains WHY each signal matters for the target company
Step 6: Signal Interpretation Review
Read references/signal-interpretation.md before writing interpretation columns. Key rules:
- Website content mentioning what the target sells = competitor signal (NOT buyer)
- Job listings = highest-intent buyer signal
- Same keyword means different things on product page vs careers page vs blog
- Tech stack tools need context â do they create or solve the target’s problem?
Enrichment Data Structure
Website data: __dl_full_result__ column containing exa_search response.
data.results[].text â page content
data.results[].url â page URL
data.results[].title â page title
Job listings: __dl_full_result__ column containing crustdata response.
data.listings[].title â job title (NOT "job_title")
data.listings[].description â job description (NOT "job_description")
data.listings[].category â job category
data.listings[].url â listing URL
Credit Estimation
| Step | Credits per row | Total (60 companies) |
|---|---|---|
| exa_search with contents | ~5 | ~300 |
| crustdata_job_listings | ~1 | ~60 |
| Total | ~6 | ~360 |
Always get user approval before running paid enrichment steps.
Common Pitfalls
- Skipping target discovery (Step 0) â Without understanding what the target sells, you’ll generate generic/irrelevant configs
- Homepage-only scraping: Always use multi-page discovery. Homepage alone misses pricing, integrations, security.
- Using hardcoded examples â Don’t copy sales-focused keywords for a creative-ops tool. Generate configs per vertical.
- Generic tech stack: Search for niche SaaS tools specific to buyer persona, not generic ones (AWS, Slack).
- Ignoring source context: “prospect” on a product page = seller signal. “prospect” in a job listing = buyer signal.
- Missing lost data: Verify lost companies have content before analysis. Empty lost = meaningless lift scores.
- Substring false positives: “sequenc” matches “consequences”. Spot-check high-lift keywords for false matches.
- Skipping config review (Step 3.5) â Always validate generated configs against enriched data before analysis.
References
- keyword-catalog.md: Read when generating configs (Step 1.5). Contains JSON format, generation patterns, and multi-vertical examples.
- report-template.md: Read when generating the report (Step 5). Full section structure, table formats, inline example format, quality rules.
- signal-interpretation.md: Read when writing interpretation columns or reviewing signal quality. Buyer vs seller vs competitor rules.
- scripts/analyze_signals.py: Deterministic analysis script. Run in Step 4. Auto-detects columns, accepts custom keywords/tools via JSON.