pseo-scale

📁 lisbeth718/pseo-skills 📅 11 days ago
0
总安装量
2
周安装量
安装命令
npx skills add https://github.com/lisbeth718/pseo-skills --skill pseo-scale

Agent 安装分布

amp 1
opencode 1
cursor 1
kimi-cli 1
github-copilot 1

Skill 文档

pSEO Scale Architecture

Architect pSEO systems that work at 10K-100K+ pages. The patterns in the other pseo-* skills are correct at 1K-10K. Beyond 10K, in-memory data layers, full-corpus validation, and single-deploy rollouts break down. This skill provides the architecture changes needed at scale.

Scale Tiers

Tier Pages Data Layer Validation Rollout Sitemap
Small < 1K JSON/files, in-memory Full pairwise Single deploy Single file
Medium 1K-10K Files or DB, two-tier memory Fingerprint-based 2-4 week batches Index + children
Large 10K-50K Database required Incremental + sampling Category-by-category Index + chunked
Very Large 50K-100K+ Database + cache layer Delta-only + periodic full ISR + sitemap waves Index + streaming

This skill focuses on the Large and Very Large tiers. If the project is under 10K pages, the standard pseo-* skills are sufficient.

1. Database-Backed Data Layer

At 10K+ pages, JSON files and in-memory arrays stop working. The data layer must move to a database with proper indexing.

Why In-Memory Breaks

Pages    PageIndex in memory    Full content (if loaded)
1K       ~1MB                   ~100-500MB
10K      ~10MB                  ~1-5GB (OOM)
50K      ~50MB (borderline)     ~5-25GB (impossible)
100K     ~100MB                 ~10-50GB (impossible)

At 50K+, even holding all PageIndex records in memory is borderline. At 100K, getAllSlugs() returning an array of 100K objects takes ~100MB and seconds to deserialize. You need cursor-based iteration.

Database Requirements

Minimum schema:

CREATE TABLE pages (
  id            SERIAL PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  canonical_path TEXT UNIQUE NOT NULL,
  title         TEXT NOT NULL,
  h1            TEXT NOT NULL,
  meta_description TEXT NOT NULL,
  category      TEXT NOT NULL,
  subcategory   TEXT,
  status        TEXT DEFAULT 'published',
  last_modified TIMESTAMPTZ NOT NULL,
  published_at  TIMESTAMPTZ,
  -- Heavy fields (only loaded per-page)
  intro_text    TEXT,
  body_content  TEXT,
  faqs          JSONB,
  related_slugs TEXT[],
  featured_image JSONB,
  -- Scale fields
  data_sufficiency_score REAL,  -- see section 2
  content_hash  TEXT,           -- for incremental validation
  last_validated TIMESTAMPTZ
);

-- Required indexes for pSEO queries
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);

Categories table:

CREATE TABLE categories (
  slug          TEXT PRIMARY KEY,
  name          TEXT NOT NULL,
  description   TEXT,
  parent_slug   TEXT REFERENCES categories(slug),
  page_count    INT DEFAULT 0,
  last_modified TIMESTAMPTZ
);

Redirects table:

CREATE TABLE redirects (
  source      TEXT PRIMARY KEY,
  destination TEXT NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);

Data Layer API at Scale

The pseo-data API contract changes at scale:

// getAllSlugs() must support cursor-based iteration at 50K+
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
  let cursor: string | null = null;
  while (true) {
    const batch = await db.query(
      `SELECT slug, category FROM pages
       WHERE status = 'published'
       ${cursor ? `AND slug > $1` : ''}
       ORDER BY slug LIMIT 1000`,
      cursor ? [cursor] : []
    );
    if (batch.length === 0) break;
    for (const row of batch) yield row;
    cursor = batch[batch.length - 1].slug;
  }
}

// getPagesByCategory() must use DB pagination, not in-memory slicing
async function getPagesByCategory(
  category: string,
  opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
  return db.query(
    `SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
     FROM pages
     WHERE category = $1 AND status = 'published'
     ORDER BY title
     LIMIT $2 OFFSET $3`,
    [category, opts?.limit ?? 50, opts?.offset ?? 0]
  );
}

// getRelatedPages() needs a precomputed index or efficient query
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
  // Option A: Use related_slugs array from the page record
  // Option B: Same category + shared tags query
  // Option C: Precomputed relatedness table (best at 50K+)
  return db.query(
    `SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
            p.category, p.last_modified
     FROM pages p
     JOIN pages source ON source.slug = $1
     WHERE p.category = source.category
       AND p.slug != $1
       AND p.status = 'published'
     ORDER BY p.last_modified DESC
     LIMIT $2`,
    [slug, limit]
  );
}

Connection Pooling

At build time with parallel page generation, you need connection pooling:

import { Pool } from "pg";

const pool = new Pool({
  max: 10,                    // limit concurrent connections during build
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

ORM alternative: If using Prisma or Drizzle, configure connection pool limits in the ORM config. Default pool sizes are often too high for build processes.

See references/database-patterns.md for full patterns by database type.

2. Data Sufficiency Gating

At 100K pages, many combinations will produce thin content. Gate page generation BEFORE build time — don’t create pages that will fail quality checks.

Sufficiency Score

Compute a score per potential page based on available data:

function computeSufficiencyScore(record: RawRecord): number {
  let score = 0;
  const weights = {
    hasTitle: 10,
    hasDescription: 10,           // > 50 chars
    hasBodyContent: 20,           // > 200 words
    hasFAQs: 15,                  // >= 3 Q&A pairs
    hasUniqueAttributes: 15,      // >= 3 non-null structured attributes
    hasImage: 5,
    hasCategory: 10,
    hasNumericData: 10,           // stats, ratings, prices — LLM citation signal
    hasSourceCitation: 5,         // data provenance for E-E-A-T
  };

  if (record.title?.length > 10) score += weights.hasTitle;
  if (record.description?.length > 50) score += weights.hasDescription;
  if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
  if (record.faqs?.length >= 3) score += weights.hasFAQs;
  // ... etc

  return score; // 0-100
}

Gating Thresholds

Score Action
80-100 Generate page — sufficient data
60-79 Generate page with enrichment flag — mark for content pipeline
40-59 Hold — do not generate until data is enriched
0-39 Reject — insufficient data, do not generate

Store the score in the database (data_sufficiency_score column) so you can:

  • Query how many pages are gated vs. ready
  • Track enrichment progress over time
  • Re-score after data enrichment
  • Gate at build time: WHERE data_sufficiency_score >= 60 AND status = 'published'

Combination Gating

For combination pages (service × city, product × use-case), both dimensions must have sufficient data:

function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
  // Both dimensions must independently clear a minimum bar
  const scoreA = computeSufficiencyScore(dimA);
  const scoreB = computeSufficiencyScore(dimB);

  // The combination itself must produce enough unique content
  // that it's not just "dimA text + dimB text" pasted together
  const combinationHasUniqueContent =
    dimA.attributes?.length >= 2 &&
    dimB.attributes?.length >= 2;

  return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}

This is the most important scale pattern. 500 services × 200 cities = 100K combinations, but maybe only 15K have enough data for a quality page. Generate only the 15K.

3. Content Enrichment Pipeline

At 100K pages, you can’t manually write intros, FAQs, and descriptions. You need an automated enrichment pipeline with quality controls.

Pipeline Architecture

Raw data (DB/CMS/API)
    │
    ▼
Data sufficiency scoring ──→ Reject/hold insufficient records
    │
    ▼
Automated enrichment ──→ Generate intros, FAQs, summaries from structured data
    │
    ▼
Quality sampling ──→ Human reviews a random sample (5-10%) per batch
    │
    ▼
Publish gate ──→ Only records with score >= 60 and enrichment complete
    │
    ▼
Page generation

Enrichment Sources (non-LLM)

Before reaching for LLM generation, exhaust structured data enrichment:

  • Template composition from multiple fields: Combine 3+ data fields into prose. Not "Best {service} in {city}" but structured sentences using attributes, stats, and category context.
  • Aggregation: Roll up child data into parent summaries (category stats, comparison tables, top-N lists)
  • Cross-referencing: Enrich records by joining data sources (product + reviews, service + location demographics, listing + category averages)
  • FAQ generation from data patterns: Turn common attribute variations into Q&A pairs (“How much does X cost in Y?” → answer from price data)
  • Comparison data: Auto-generate comparison sections from sibling records in the same category

LLM-Assisted Enrichment

If structured enrichment is insufficient, LLM-assisted generation is acceptable under strict conditions:

  1. Never generate the entire page content — LLM fills gaps in an otherwise data-driven page
  2. Always ground in real data — the LLM prompt includes the record’s actual attributes, stats, and context
  3. Human review sampling — review 5-10% of LLM-generated content per batch before publishing
  4. Store the generation metadata — track which fields were LLM-generated vs. sourced from data
  5. Apply quality guard after enrichment — run pseo-quality-guard on enriched content before publishing
  6. Regenerate periodically — stale LLM content should be refreshed when underlying data changes

Google’s position (2025): AI-generated content is acceptable if it provides genuine value. The risk is not the generation method but the output quality. Scaled LLM generation that produces interchangeable pages will trigger the same penalties as template spam.

4. Incremental Validation

At 100K pages, full-corpus quality checks take hours. Switch to incremental validation.

Delta Validation

Only validate pages that changed since the last validation run:

async function getChangedPages(since: Date): Promise<string[]> {
  const result = await db.query(
    `SELECT slug FROM pages
     WHERE last_modified > $1
       OR last_validated IS NULL
       AND status = 'published'`,
    [since]
  );
  return result.map(r => r.slug);
}

Content hashing: Store a hash of each page’s rendered content. On validation, only re-check pages whose hash changed:

import { createHash } from "crypto";

function contentHash(page: BaseSEOContent): string {
  const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
  return createHash("sha256").update(content).digest("hex").slice(0, 16);
}

Periodic Full Scan

Run a complete validation weekly or before major releases. For the full scan at 100K:

  • Parallelize: Run 4-8 validation workers, each processing a category partition
  • Sample cross-category similarity: Don’t compare all 100K × 100K. Compare each page against 50 random pages from other categories + all pages in the same category.
  • Stream results to disk: Write validation results to a JSONL file, then aggregate. Don’t accumulate all results in memory.
// Parallel validation by category
const categories = await getAllCategories();
const workers = categories.map(cat =>
  validateCategory(cat.slug) // each worker handles one category
);
const results = await Promise.all(workers);

Validation Budget

Scale Delta validation Full validation
10K Minutes 30-60 minutes
50K Minutes 2-4 hours
100K Minutes 4-8 hours

Optimization: Pre-compute and store fingerprints (MinHash signatures) in the database. Validation then only compares fingerprints, not full content.

5. Crawl Budget Management

At 100K pages, Google won’t crawl everything immediately. Crawl budget — the number of pages Google crawls per day — becomes a constraint.

Sitemap Submission Strategy

sitemap-index.xml
├── sitemap-category-a.xml     (≤ 50,000 URLs)
├── sitemap-category-b.xml
├── sitemap-category-c.xml
└── ...

Submission cadence:

  • Submit the sitemap index to both Google Search Console and Bing Webmaster Tools
  • Update individual category sitemaps as pages are added
  • Don’t submit all 100K URLs at once — Google throttles crawling for new sites with sudden URL spikes

Programmatic sitemap submission:

// Use Google Indexing API for high-priority pages (job postings, livestreams)
// For standard pages, rely on sitemap discovery + Search Console

// Batch sitemap updates by category
async function updateCategorySitemap(category: string) {
  const pages = await getPagesByCategory(category, { limit: 50000 });
  const xml = generateSitemapXml(pages);
  await writeFile(`public/sitemap-${category}.xml`, xml);
}

Crawl Budget Optimization

  • Prioritize high-value pages: Set priority in sitemap to guide Googlebot (0.8 for hubs, 0.6 for pages with high sufficiency scores, 0.4 for the rest)
  • Remove low-quality pages from sitemap: Pages with sufficiency score < 60 should not be in the sitemap
  • Fix crawl waste: Ensure no soft 404s, no redirect chains, no parameter-based duplicates — all waste crawl budget
  • Server response time: Keep TTFB < 500ms. Slow servers get less crawl budget.
  • Monitor crawl stats: Check Google Search Console → Settings → Crawl stats weekly. If crawl rate drops, investigate.

Indexing Rate Expectations

Google does not guarantee indexing all pages. Realistic expectations:

Site authority Pages submitted Likely indexed (6 months)
New site 100K 10-30%
Established (DR 30-50) 100K 40-70%
High authority (DR 60+) 100K 70-90%

If indexing rate is low:

  1. Improve content quality on indexed pages first
  2. Earn more backlinks to category hubs
  3. Reduce total page count (prune thin pages) — a smaller, higher-quality corpus often indexes better than a large, mediocre one
  4. Ensure internal linking reaches every page within 3 clicks

6. Monitoring at Scale

At 100K pages, you can’t manually check pages. Build automated monitoring.

Key Metrics to Track

Metric Source Alert Threshold
Pages indexed Google Search Console API Drops > 5% week-over-week
Crawl rate Google Search Console API Drops > 20%
Crawl errors (5xx, 404) Server logs, GSC > 1% of total pages
CWV regressions CrUX API or RUM LCP > 4s or CLS > 0.25 on any template
Build duration CI/CD logs > 2x baseline
Build memory peak CI/CD logs > 80% of available memory
Page count by status Database Published count deviates from expected
Sufficiency score distribution Database > 20% of published pages below threshold

Automated Monitoring Script

// scripts/monitor-pseo.ts — run daily via cron or CI
async function monitor() {
  const metrics = {
    totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
    avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
    belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
    recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
    neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
    redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
    brokenRedirects: await db.query(
      "SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
    ),
  };

  // Output report or send to monitoring service
  console.log(JSON.stringify(metrics, null, 2));

  // Alert on critical conditions
  if (metrics.brokenRedirects > 0) console.error("ALERT: Broken redirects found");
  if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
    console.error("ALERT: >20% of published pages below sufficiency threshold");
  }
}

Search Console API Integration

At 100K pages, manual Search Console checks are impractical. Use the API:

  • Indexing status: Query the URL Inspection API in batches to check indexing status of new pages
  • Performance data: Pull clicks, impressions, CTR by page template to identify underperforming page types
  • Coverage issues: Monitor for “Crawled — currently not indexed” and “Discovered — currently not indexed” trends

7. Edge and CDN Architecture

At 100K pages, the origin server can’t handle all traffic directly.

Caching Strategy

Client → CDN Edge → Origin (Next.js/framework)
              ↓
         Cache Layer (Redis/edge KV)
              ↓
           Database

CDN configuration:

  • Cache all pSEO pages at the edge with s-maxage=86400, stale-while-revalidate=3600
  • Use ISR revalidation to refresh cached pages (not full rebuilds)
  • Set longer TTLs for stable pages (30 days), shorter for dynamic data pages (1 day)

Cache invalidation:

  • On data change → invalidate the specific page’s cache via on-demand revalidation API
  • On category change → invalidate all pages in the category
  • On template change → purge the CDN for all pages of that template type

Edge rendering (if supported):

  • Deploy to edge runtimes (Vercel Edge, Cloudflare Workers) for < 50ms TTFB globally
  • Not all frameworks support edge rendering — check compatibility
  • Edge functions have memory limits (~128MB) that constrain complex data operations

Database Connection from Edge

Edge functions can’t maintain persistent database connections. Options:

  • HTTP-based database (PlanetScale, Neon serverless driver, Supabase edge functions)
  • Edge KV store (Cloudflare KV, Vercel KV) for index-tier data with database as source of truth
  • Pre-generated JSON at build time for index-tier data, database only for full page content

8. Scale-Specific Build Strategy

At 100K pages, the build process itself needs architecture.

Don’t Build All Pages

// At 100K, only pre-build the most important pages
export async function generateStaticParams() {
  // Pre-build: hub pages + top 1K pages by traffic/priority
  const hubs = await getAllCategories();
  const topPages = await db.query(
    `SELECT slug, category FROM pages
     WHERE status = 'published' AND data_sufficiency_score >= 80
     ORDER BY priority DESC LIMIT 1000`
  );
  return [
    ...hubs.map(h => ({ category: h.slug })),
    ...topPages.map(p => ({ category: p.category, slug: p.slug })),
  ];
}

// ISR handles the remaining 99K pages on first request
export const dynamicParams = true;
export const revalidate = 86400; // 24 hours

Build Time Budget

Pages pre-built Expected build time Memory
1K (hubs + top pages) 5-15 minutes 2-4GB
5K 15-45 minutes 4-6GB
10K 30-90 minutes 6-8GB
100K (DON’T DO THIS) 5-15 hours 16GB+

Rule: Never pre-build more than 10K pages. Use ISR for everything else.

Warm-Up After Deploy

After deploying, the ISR cache is cold. The first visitor to each page triggers generation. For critical pages:

// scripts/warm-cache.ts — run after deploy
async function warmCache() {
  const priorityPages = await db.query(
    `SELECT canonical_path FROM pages
     WHERE data_sufficiency_score >= 80 AND status = 'published'
     ORDER BY priority DESC LIMIT 5000`
  );

  // Hit each page to trigger ISR generation (rate-limited)
  for (const page of priorityPages) {
    await fetch(`${baseUrl}${page.canonical_path}`);
    await sleep(100); // 10 pages/second — don't DDoS yourself
  }
}

Checklist

  • Database is the primary data store (not JSON files or in-memory arrays)
  • Required indexes exist on slug, category, status, last_modified, sufficiency_score
  • Data sufficiency scoring is implemented and stored per page
  • Pages with score < 60 are gated from generation
  • Combination pages are gated on both dimensions
  • Content enrichment pipeline exists (structured data first, LLM-assisted only with review)
  • Incremental validation is implemented (delta + periodic full scan)
  • Content hashes are stored for change detection
  • Sitemap is split by category with index file
  • Sitemap excludes pages below sufficiency threshold
  • Crawl budget is monitored via Search Console
  • Monitoring script runs daily with alerts
  • CDN caching is configured with appropriate TTLs
  • ISR handles the long tail (only top pages pre-built)
  • Cache warm-up script exists for post-deploy
  • Connection pooling is configured for build-time queries
  • No function loads > 10K full page records into memory

Relationship to Other Skills

  • Extends: pseo-data (replaces in-memory patterns with database), pseo-performance (adds CDN/edge and scale-specific build strategy), pseo-quality-guard (adds incremental validation)
  • Depends on: All content and structure skills must be in place before scaling
  • Validated by: pseo-quality-guard (quality doesn’t change — scale does)
  • Works with: pseo-orchestrate (scale considerations at every phase)